Friday, 7 March 2008

Data, repositories and Google

In a post last year, Peter Murray Rust criticised DSpace as a place to keep data:
"The search engines locate content. Try searching for NSC383501 (the entry for a molecule from the NCI) and you’ll find: DSpace at Cambridge: NSC383501

"But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn’t indexed. So if you know how to look for it through the ID, fine. If you don’t you won’t. [...]

"So (unless I’m wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. [...]

"If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing - they don’t discover material though simple searches."
Peter isn't often wrong, but in this case it was clear from comments to his post that Google does normally index DSpace content, not just the metadata. There were a couple of reasons for the effects Peter saw, but the key one related to the nature of the data. Jim Downing wrote, for example:
"Not sure what to tell you about your ChemML files. Possibly Google doesn’t know what to do with them and doesn’t try?

"That’s my understanding - interestingly, if you lie about the MIME type, Google does index CML (here, for example)."
The data Peter refers to is Chemical Markup Language data in a file with extension .cml. My Mac does not know what it is, and I guess no more does Google… unless perhaps you tell Google that it’s text, as Jim Downing seemed to be suggesting in his comment (I’m not sure this constitutes lying, more selective use of the truth). I can open CML files in my text editor, fine, although of course to process them into something chemically interesting, I would need some additional software or plugins… Here's a chunk of that file [sorry, tried to include some XML here but Blogger swallowed it up]...

There's real data here [trust me: INCHI and SMILE at least, plus bond strengths etc] that could be indexed but isn't. The point is, surely, that this would be just as much a problem if the repository was simply a filestore full of CML files, which is how data is often made available. But unlike the filestore, there is usually some useful metadata in the repository which can assist data users (ie people, in this case); in a filestore, this is either absent, encoded in filenames, or in some conventional place such as README.TXT where it's relation to the actual data file is problematic).

So: in the first place, Google et al are unlikely to index data, particularly unusual data types. And in the second place, repositories encourage metadata, which does get indexed. So from this point of view at least, a repository may provide better exposure for your data (and hence more data re-use) than simply making the files web-accessible.

This doesn't mean that current, library-oriented repositories are yet fit for purpose for science data! Far from it...


  1. Honest and serious question: What is wrong with the 'library-oriented' repositories for science data? could you write about this for us?

    ta, Stephanie

  2. Thanks for the comment, I will come back to repositories more in future. But we do have to remember that the current set of repository platforms (particurly ePrints, and I think also DSpace, perhaps less so Fedora, not sure ofothers) were mainly created with text in mind. When you get to data, there are issues of scale in size of dataset, numbers of datasets, rates of deposit, update and curation/correction that play out differently from "text"...


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.