Monday 31 March 2008

UK Repositories claiming to hold data

The OpenDOAR and ROAR services both present self-reported claims by repositories across the world about their contents, backed up by some harvested facts. I’m interested in those UK repositories that claim to hold data.

My first problem is that neither repository allows me simply to choose data. OpenDOAR allows me to search on “Datasets” (63 world-wide, 8 in the UK), while ROAR allows me to search for “Database/A&I Index” (24 world-wide, 6 in the UK). I thought the latter was a surprisingly “library science” classification, given the origins of ROAR. Not surprisingly, most repositories are in only one of the lists. Also not surprisingly given the origins of these services in the Open Access and OAI-PMH movements, there are many first class data repositories NOT listed here (UKDA and BADC, for example).

The UK repositories listed are:

OpenDOAR “Datasets”
Looking at the OpenDOAR listing, and linking through to the repositories themselves, I find it very difficult to actually FIND the datasets in most cases. Looking at ERA, for example, there is no effective search for these datasets. Browsing soon leads to the realisation that the contents are papers, articles, theses, etc. Some of these may have datsets associated or within them, but they are a bit shy! The Edinburgh Datashare repository is a pilot, but does have a couple of real datasets. In a different way, Nature Precedings also is shy of disclosing its datasets.

The 3 that do have serious amounts of data are DSpace @ Cambridge, eCrystals and NDAD. DSpace @ Cambridge is dominated by the 100,000 ++ collection of chemical structures encoded in CML, but there are plenty of other datasets there, including some from Archaeology. Sadly, there are plenty of empty collections, and many collections where the last deposit was 2006 (I guess around when the funded project died). eCrystals is completely crystal structures, and has some very nice features; find a compound, and as you look perhaps rather bemused at the page, a Java object loads and there you have a rotatable image of the molecular structure before your eyes on the data page! NDAD also has many ex-Government datasets, some of them very large.

ROAR “Database/A&I Index”
NDAD and eCrystals (under a slightly different name) appear again in this ROAR set. Of the others, HEER seems to be closed (it wanted a password for every page I found), and ReFer seems to have been withdrawn. ReOrient seems to be frozen and to present its data in the form of maps, while the Linnaean Collection seems to be images (which can, of course, be good data as well).

It’s a rather sad study! I do hope that the Open Repositories 2008 conference in Southampton over the next couple of days leads to an improvement. I can't get there, unfortunately, but I hope someone will report from it here. I particularly liked the idea of the developers challenges. Can we have some oriented to data, please?

Repositories for scientists

Nico Adams in a post to Staudinger's Semantic Molecules has added to the scenarios for repositories for scientists:
"Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. "
The linkage seems to be the important thing, which current repositories don't do well. He goes on:
"Let’s take a concrete example [CR: scenario, I think] to illustrate what I mean: my institutional repository provides me with a workspace, which I can use in my scientific work everyday. In that workspace, I have my collection of literature (scientific papers, other people’s theses etc.), my scientific data (spectra, chromatograms etc) as well as drafts of my papers that I am working on at the moment. Furthermore, I have the ability to share some of this stuff with my colleagues and also to permanently archive data and information that I don’t require for projects in the future."
He then develops this argument further, making a lot of sense, and concludes:
"Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. As an outsider, it seems to me that what needs to happen now, is for these technologies to converge and integrate. Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!"
I have heard it suggested elsewhere that DSpace may not be up to this kind of linkage, and also the FEDORA is a suitable candidate, but I don't have sufficient experience of the workings of either to be sure. Anyone comment?

Wednesday 26 March 2008

Significant Properties workshop

For my sins, I have agreed to chair the BL/DPC/JISC workshop "What to preserve? Significant Properties of Digital Objects" in just over a week. Mostly my duties will be to introduce speakers, and to keep them to the (ferocious) timetable, with perhaps a little light stimulation of discussion towards the end. However, I thought I should spend some time thinking about the idea of Significant Properties in digital preservation before the meeting. I hope (but don't guarantee, due to other commitments) that a couple of further posts on the topic will appear shortly.

This idea is something that has been on the fringes of my consciousness for a long time. The earliest reference I have was in 2001, when the CEDARS project (part of the eLib programme that I managed up to 2000) suggested in their final report that during the pre-ingest phase, an archive would have to assess the significant properties of the objects to be ingested:
"The archive will need to make decisions about what level of preservation is appropriate for each digital object (or each class of objects). This involves assessing which properties of a particular digital object are regarded as significant. These decisions influence the levels and methods of access that will be possible for the object, and the level of preservation metadata required for long-term retention."
The project later held a workshop where participants attempted to agree on the significant properties of a sample set of digital objects, and the work continued and overlapped with the CAMiLEON project, JISC/NSF funded and jointly between Leeds and Michigan.

OK, so that's the ancient history (I would be interested to know of anything even more ancient; I could not find any reference in OAIS, for example, so it maybe that the CEDARS team invented the concept). Thinking about it now, I have a whole bunch of questions in my mind, including:
  • what properties?
  • of which objects?
  • for whom?
  • for what purposes?
  • when?
  • and maybe where?
I'm sure there will be a bunch of definitions of significant properties coming up during the meeting. I liked one in a separate context, from Barbara Sierman of the Netherlands KB I hope she doesn't mind me quoting it):
"an important property of a certain digital object, as experienced by the user. Significant properties can be classified by five aspects of a digital object: structure, content, context, appearance, and behaviour. Examples: text (content), chapters (structure), metadata (context), colour (appearance), zoom-functionality (behaviour)."
The forthcoming workshop will feature reports from a number of projects and studies that JISC has funded in this area. It should turn out very interesting!

Tuesday 25 March 2008

Pulsating white dwarf stars!

I really like the physics arXiv blog, which has quirky posts on interesting stuff recently posted to the arXiv. This post (New type of pulsating star discovered) in particular interests me, in this case from a digital curation rather than a general interest point of view. The post starts:
"New types of stars aren’t found very often but last year, Patrick Dufour and pals discovered several white dwarfs with carbon atmospheres. Before then white dwarfs were thought to come in two flavours: with atmospheres dominated by either hydrogen or helium. Astronomers suddenly had a new toy to play with.

"Dufour found nine examples of his carbon dwarfs in the data regurgitated by the Sloan Digital Sky Survey and more are likely to be found as the skies continue to be searched."
What's more they worked out that some of these new stars should be pulsating. An update suggests they found one shortly afterwards. A nice example of science emerging from the data; no doubt there are many more in astronomy, with these huge sky surveys and virtual observatories!

Thursday 20 March 2008

Migration on Request: OpenOffice as a platform?

Following on from my previous post relating to legacy formats, I was thinking again about the problems of dealing with documents in those formats. For some, the answer lies in emulation and perpetual licences of those original software packages, but for me that just doesn't cut the mustard. I won't have access to those packages, but I might want access to the documents. Some of them for example, might be the PowerPoint 4 presentations created on a predecessor to the Macintosh that I use now, but which are un-readable with my current PowerPoint software (I CAN get at them by copying them to a colleague's Windows machine; her version of PowerPoint has input filters unavailable on my Mac).

So I want some form of migration. In the example above, this is known as "Save as"!

However, I know that every time I do migration I introduce some sort of errors. So if I migrate from those PowerPoint 4 files to today's PowerPoint, and then from today's to tomorrow's PowerPoint, and then from tomorrow's to the next great thing, I will introduce cumulative errors whose impact I will only be able to assess at some horribly cringe-making moment, like in the middle of a presentation using a host's machine. So the best way to do migration is to start from the original file and migrate to today's version. Always. It's nuts for Microsoft to drop old file format support from its software (at least from this pint of view).

This approach of migrating from original version to today's version is called Migration on Request, and was described in a paper by Mellor, Wheatley and Sergeant back in 2002 (I referred to it earlier), but the idea hasn't caught on much. They had some other great ideas, like writing the migration tool in a specially portable version of C with all the nasty bits removed, called C--.

I have wondered from time to time however, for that class of documents we call Office Documents (word processing, spreadsheets, presentations), whether tacking onto an open source project which has a strong developer community might be a better approach. Something like OpenOffice. I'm not sure how many file formats this already supports (always growing, I guess, but Chapter 3 of the "Getting Started" documentation lists the following:
Microsoft Word 6.0/95/97/2000/XP) (.doc and .dot)
Microsoft Word 2003 XML (.xml)
Microsoft WinWord 5 (.doc)
StarWriter formats (.sdw, .sgl, and .vor)
AportisDoc (Palm) (.pdb)
Pocket Word (.psw)
WordPerfect Document (.wpd)
WPS 2000/Office 1.0 (.wps)
DocBook (.xml)
Ichitaro 8/9/10/11 (.jtd and .jtt)
Hangul WP 97 (.hwp)
.rtf, .txt, and .csv
... which is not a bad list (just the word processing bit, too)... and maybe extended in more up to date versions. For interest their FAQs have a question "Why does OpenOffice not support the file format my application uses?"
"There may be several reasons, for example:
  • The file formats may not be open and available.
  • There may not be enough developers available to do the work (either paid or volunteer).
  • There may not be enough interest in it.
  • There may be reasonable, available workarounds."
Making legacy file formats more open was the subject of my previous post, and I guess we have to wait and see. But there are plenty of legacy word processing formats not on that list (Samna, for example, later to evolve into Lotus Word Pro, as well as formats for obsolete computers like the Atari, such as the German word processor SIGNUM, supposedly very good for mathematical formulae). What about earlier version of MS Word? Wikipedia lists a bunch of word processors; there must be many documents in obscure locations in these formats.

With a concerted effort, we could gradually build OpenOffice input filters for these obsolete document types, thus brining them into the preservable digital world. And this is an effort that could bring in that extraordinary community of enthusiasts who do so much to build document converters and other kinds of software, so much ignored by the digital preservation community!

Legacy document formats

On the O'Reill XML blog, which I always read with interest (particularly in relation to the shenanigins over OOXML and ODF standardisation), Rick Jelliffe writes An Open Letter to Microsoft, IBM/Lotus, Corel and others on Lodging Old File Formats with ISO. He points out that
"Corporations who were market leaders in the 1980s and 1990s for PC applications have a responsibility to make sure that documentation on their old formats are not lost. Especially for document formats before 1990, the benefits of the format as some kind of IP-embodying revenue generator will have lapsed now in 2008. However the responsibility for archiving remains.

"So I call on companies in this situation, in particular Microsoft, IBM/Lotus, Corel, Computer Associates, Fujitsu, Philips, as well as the current owners of past names such as Wang, and so on, to submit your legacy binary format documentation for documents (particularly home and office documents) and media, to ISO/IEC JTC1 for acceptance as Technical Specifications.[...] Handing over the documentation to ISO care can shift the responsibility for archiving and making available old documentation from individual companies, provide good public relations, and allow old projects to be tidied up and closed."
This is in principle a Good Idea. However, ISO documents are not Open Access; the specifications Rick refers to would benefit greatly from being Open. They would form vitally important parts of our effort to preserve digital documents. Instead of being deposited in ISO, they should be regarded as part of Representation Information for those file types, and deposited in a variety (more than one, for safety's sake) of services such as PRONOM at The National Archive in the UK, the proposed Harvard/Mellon Global Digital Format Registry, the Library of Congress Digital Preservation activity or the DCC's own Registry/Repository of Representation Information.

Tuesday 18 March 2008

Novartis/Broad Institute Diabetes data

Graham Pryor spotted an item on the CARMEN blog, pointing to a Business Week article (from 2007, we later realised) about a commercial pharma (Novartis) making research data from its Type 2 Diabetes studies available on the web. This seemed to me an interesting thing to explore (as a data person, not a genomics scientist), both for what it was, and for how they did it.

I could not find a reference to these data on the Novartis site, but I did find a reference to a similar claim dating back to 2004, made in the Boston Globe and then in some press releases from the Broad Institute in Cambridge, MA, referring to their joint work with Novartis (eg initial announcement, first results and further results). The first press release identified David Altshuler as the PI, and he was kind enough to respond to my emails and point me to their pages that link to the studies and to the results they are making available.

Why make the data available? The Boston Globe article said "Commercially, the open approach adopted by Novartis represents a calculated gamble that it will be better able to capitalize on the identification of specific genes that play a role in Type 2 diabetes. The firm already has a core expertise in diabetes. Collaborating on the research will give its scientists intimate knowledge of the results."

The Business Week article said "...the research conducted by Novartis and its university partners at MIT and Lund University in Sweden merely sets the stage for the more complex and costly drug identification and development process. According to researchers, there are far more leads than any one lab could possibly follow up alone. So by placing its data in the public domain, Novartis hopes to leverage the talents and insights of a global research community to dramatically scale and speed up its early-stage R&D activities."

Thus far, so good. Making data available for un-realised value to be exploited by others is at the heart of the digital curation concept. There are other comments on these announcements that cynically claim that the data will have already been plundered before being made accessible; certainly the PIs will have first advantage, but there is nothing wrong with that. The data availability itself is a splendid move. It would be very interesting to know if others have drawn conclusions from the data (I did not see any licence terms, conditions, or even requests such as attribution, although maybe this is assumed as scientific good practice in this area).

Business Week go on to draw wider conclusions:
"The Novartis collaboration is just one example of a deep transformation in science and invention. Just as the Enlightenment ushered in a new organizational model of knowledge creation, the same technological and demographic forces that are turning the Web into a massive collaborative work space are helping to transform the realm of science into an increasingly open and collaborative endeavor. Yes, the Web was, in fact, invented as a way for scientists to share information. But advances in storage, bandwidth, software, and computing power are pushing collaboration to the next level. Call it Science 2.0."
I have to say I'm not totally convinced about the latter phrase. Magazines like Business Week do like buzz-words like Science 2.0, but so far comparatively little science is affected by this kind of "radicalsharing". Genomics is definitely one of the poster children in this respect, but the vast majority of science continues to be lab or small group based with an orientation towards publishing results as papers, not data.

So what have they made available? There are 3 diabetes projects listed:
  1. Whole Genome Scan for Type 2 Diabetes in a Scandinavian Cohort
  2. Family-based linkage scan in three pedigrees with extreme diabetes phenotypes
  3. A Whole Genome Admixture Scan for Type 2 Diabetes in African Americans
The second of these does not appear to have data available online. The 3rd project has results data in the form of an Excel spreadsheet, with 20 columns and 1294 rows; the data appear relatively simple (a single sheet, with no obvious formulae or Excel-specific issues that I could see), and could probably have been presented just as easily as CSV or another text variant. There's a small amount of header text in row 2 that spans columns, plus some colour coding, that may have justified the use of Excel. Short to medium term access to these data should be simple.

The first project shows two different types of results, with a lot more data: Type 2 Diabetes results and Related Traits results. The Type 2 Diabetes results comprise a figure in JPEG or PDF, plus data in two forms: a HTML table of "top single-marker and multi-marker results", and a tab-delimited text file (suitable for analysis with Haploview 4.0) of "all single-marker and multi-marker results". These data are made available both as the initial release of February 2007, and an updated release from March 2007. There is a link to Instructions for using the results files, effectively short-hand instructions for feeding the data into Haploview and doing some analyses on them. The HTML table is just that; data in individual cells are numbers or strings, without any XML or other encoding. There are links to entries in NCBI, HapMap and Ensembl, however.

The Related Traits results also come in an initial release (also February 2007) and an updated release from September 2007. The results again have a summary, a table this time but still in JPEG or PDF form. The detailed results are more complex; there is a HTML table of traits in 4 groups (Glucose, Obesity, Lipid and Blood Pressure), and for each trait (eg Fasting Glucose) up to 4 columns of data. The first column is a description of the trait as a PDF, the next is a link to a HTML Table of Top Single Marker Results for Association, the next is a link to a text Table of All Single Marker Results for Association, and the last is a link to a text table of Phenotype summary statistics by genotype (both these have the same format as above, although the latter has different columns).

It seems clear that there is a lot of data here; how useful they are to other scientists is not for me to judge. Certainly a scientist looking through these pages could form judgments on the usefuleness and relevance of these data to his or her work. There's not much to help a robot looking for science data from the Internet. I'm not sure what form such information might take, although there are examples in Chemistry. Perhaps the data cells should be automatically encoded according to a relevant ontology, so that the significance of the data travels with them. Possibly microformats or RDFa could have (or come to have) some relevance. However, both the HTML and text formats are very durable, (more so than the Excel format for project 3) and should be easily accessible (or transformed into later forms) at least as long as the Broad Institute wishes to continue to make them available.

Friday 7 March 2008

Data, repositories and Google

In a post last year, Peter Murray Rust criticised DSpace as a place to keep data:
"The search engines locate content. Try searching for NSC383501 (the entry for a molecule from the NCI) and you’ll find: DSpace at Cambridge: NSC383501

"But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn’t indexed. So if you know how to look for it through the ID, fine. If you don’t you won’t. [...]

"So (unless I’m wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. [...]

"If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing - they don’t discover material though simple searches."
Peter isn't often wrong, but in this case it was clear from comments to his post that Google does normally index DSpace content, not just the metadata. There were a couple of reasons for the effects Peter saw, but the key one related to the nature of the data. Jim Downing wrote, for example:
"Not sure what to tell you about your ChemML files. Possibly Google doesn’t know what to do with them and doesn’t try?

"That’s my understanding - interestingly, if you lie about the MIME type, Google does index CML (here, for example)."
The data Peter refers to is Chemical Markup Language data in a file with extension .cml. My Mac does not know what it is, and I guess no more does Google… unless perhaps you tell Google that it’s text, as Jim Downing seemed to be suggesting in his comment (I’m not sure this constitutes lying, more selective use of the truth). I can open CML files in my text editor, fine, although of course to process them into something chemically interesting, I would need some additional software or plugins… Here's a chunk of that file [sorry, tried to include some XML here but Blogger swallowed it up]...

There's real data here [trust me: INCHI and SMILE at least, plus bond strengths etc] that could be indexed but isn't. The point is, surely, that this would be just as much a problem if the repository was simply a filestore full of CML files, which is how data is often made available. But unlike the filestore, there is usually some useful metadata in the repository which can assist data users (ie people, in this case); in a filestore, this is either absent, encoded in filenames, or in some conventional place such as README.TXT where it's relation to the actual data file is problematic).

So: in the first place, Google et al are unlikely to index data, particularly unusual data types. And in the second place, repositories encourage metadata, which does get indexed. So from this point of view at least, a repository may provide better exposure for your data (and hence more data re-use) than simply making the files web-accessible.

This doesn't mean that current, library-oriented repositories are yet fit for purpose for science data! Far from it...