Wednesday, 18 November 2009

Data and the journal article

I recently had a discussion (billed as a presentation, but it was on such an (ahem) intimate scale that it became a discussion) at Ithaka, the organisation in New York that runs JSTOR, ArtSTOR and Portico. We talked about some of the issues surrounding supporting journal articles better with data. Both research funders and some journals are starting to require researchers/authors to keep and to make available the data that supports the conclusions in their articles. How can they best do this?

It seems to me that there are 4 ways of associating data with an article. The first is through the time-honoured (but not very satisfactory) Supplementary Materials, the second is through citations and references to external data, the third is through databases that are in some way integrated with the article, and the fourth is through data encoded within the article text.

My expectation was that most supplementary materials that included data would actually be in Excel spreadsheets, and a few would be in CSV files, while even fewer would be in domain-specific, science-related encodings. I was quite shocked after a little research to find, at least for the Nature journals I looked at, that nearly all supplementary data were in PDF files, while a few were in Word tables. I don't think I found any that were Excel, let alone CSV. This doesn't do much for data re-usability! As things stand currently, data in a PDF document (eg in tables) will probably need to be extracted by hand copy; possibly by cut and paste followed by extensive hand manipulation.

I would expect that looking away from the generalist journals towards domain-specific titles, would reveal more appropriate formats. However, a ridiculously quick check of Chem-Comm, a Royal Society of Chemistry title, showed supplementary data in PDF even for an "electronically enhanced article (eg Experimental procedures, spectra and characterization data, perhaps not openly accessible...).

There’s a bit of concern in some quarters about journals managing data, particularly that data would disappear behind the pay wall, limiting opportunities for re-use.

What would be ideal? I guess data that are encoded in domain-specific, standardised formats (perhaps supported by ontologies, well-known schemas, and/or open software applications) would be pretty useful. I’ve also got a vague sense of unease about the lack of any standardised approach to describing context, experimental conditions, instrument calibrations, or other critical metadata needed to interpret the data properly. This is a tough area, as we want to reduce the disincentives to deposit as well as increase the chances of successful re-use.

Clearly there are many cases where the data are not appropriate for inclusion as supplementary materials, and should be available by external reference. Such would be the case for genomics data, for example, which must have been deposited in an appropriate database (the journal should demand deposit/accession data before publication).

External data will be fine as long as they are on an accessible (not necessarily open) and reasonably permanent database, data centre or repository somewhere. I do worry that many external datasets will be held on personal web sites. Yes, these can be web-accessible, and Google-indexed, but researchers move, researchers die, and departments reorganise their web presence, which means those links will fail, and the data will disappear (see the nice Book of Trogool article "... and then what?").

Sometimes such external data can be simply linked, eg a parenthetical or foot-noted web link, but I would certainly like to encourage increasing use of proper citations for data. Citations are the currency of academia, and the sooner they accrue for good data, the sooner researchers will start to regard their re-usable data as valuable parts of their output! It’s interesting to see the launch of the DataCite initiative coming up soon in London.

There is this interesting idea of the overlay data journal, which rather turns my last paragraph on its head; the data are the focus and the articles describe the data. Nucleic Acids Research Database Issue articles would be prime examples here in existing practice, although they tend to describe the dataset as a persistent context, rather than as the focus for some discovery. The OJIMS project described a proposed overlay journal in Meteorology; they produced a sample issue and a business analysis, but I’m not sure what happened then.

The best (and possibly only) example I know of the database-as-integral-part-of-article approach is Internet Archaeology, set up in 1996 (by the eLib programme!) as an exemplar for true internet-enabled publishing. 13 years later it's still going strong, but has rarely been emulated. Maybe what it provides does not give real advantages? Maybe it's too risky? Maybe it’s too hard to create such articles? Maybe scholarly publishing is just too blindly conservative? I don't know, but it would be good to explore in new areas.

Peter Murray-Rust has argued eloquently at the tragedy of data trapped and rendered useless in the text, tables and figures of articles. We would like to see articles semantically enriched so that these data can be extracted and processed. Encoded data points us to a few examples, such as the Shotton enhanced article described in Shotton et al 2009, and also to the Murray-Rust/Sefton TheoREM-ICE approach (although that was designed for theses, I think). I think the key here is the lack of authoring tools. It is still rather difficult to actually do this stuff, eg to write an article that contains meaningful semantic content. The Shotton target article was marked up by hand, with support from one of the authors of the W3C SKOS standard, ie an expert! The chemists having been working on tools for their community, both the ICE example, and also MS Chem4Word, maybe ChemMantis, etc.

This last paragraph also points us towards the thesis area; I think this is one that Librarians really ought to be interested in tackling. What is the acceptable modern equivalent to the old (but never really acceptable) practice of tucking disks into a pocket inside the back cover of a thesis? Many universities are now accepting theses in digital form; we need some good practice in how to deal with their associated data.

So, we seem to be quite a way from universal good practice in associating data with our research articles.

Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Computational Biology, 5(4). doi: 10.1371/journal.pcbi.1000361.

1 comment:

  1. Hi Chris,

    linking the paper object to the data object via OAI-ORE seems eminently feasible for this. And as to context: there are more and more ontologies coming on-stream now that will allow you to mark up your data with experimental context: see the OBI ontology for example (for biomedical investigations)


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.