Wednesday 25 November 2009

IDCC 2009 Amplified!

As Chris has recently announced, the annual International Digital Curation Conference is almost upon us. This year's event will be amplified using a range of online social media tools to help include those who can't make it to London on 3rd and 4th December, and to capture the online conversation surrounding the event for future reference.

This blog will form the centre point of the coverage. There will be summaries of each of the sessions, video interviews with speakers and delegates, and much more. So, if you are reading via the RSS feed, expect a flurry of updates throughout the conference! If you're not subscribed to the RSS feed, make sure you check back regularly during the event to see what's been covered.

You will also be able to follow the official live commentary of each of the plenary sessions on Twitter by following @idcclive, and take part in the conversation using the event hash tag #idcc09. If you have a question for a speaker, simply tweet your question to @idcclive and it will be relayed to the speaker for you at an appropriate point.

We look forward to seeing you at IDCC 09 – whether in person or online!

The amplification of IDCC 09 will be co-ordinated by Kirsty McGill. Kirsty is the Creative Director of communications and training firm
TConsult Ltd.

Wednesday 18 November 2009

Workshops prior to the International Digital Curation Conference

Pre-conference workshops can be very useful and interesting; they can be a good part of the justification for attending a conference, giving an extended opportunity to focus on a single topic, followed by a broader (but shallower) look at many topics, at the conference itself. This time it is quite frustrating, as I would very much like to go to all the workshops! There is still time to register for your choice, and for the IDCC conference itself.

Disciplinary Dimensions of Digital Curation: New Perspectives on Research Data

Our SCARP Project case studies have explored data curation practice across a variety of clinical, life, social, humanities, physical and engineering research communities. This workshop is the final event in SCARP, and will present the reports and synthesis.

See the full programme [PDF]

Digital Curation 101 Lite Training

Research councils and funding bodies are increasingly requiring evidence of adequate and appropriate provisions for data management and curation in new grant funding applications. This one-day training workshop is aimed at researchers and those who support researchers and want to learn more about how to develop sound data management and curation plans.

See the full programme [PDF]

Citability of Research Data

Goal: Handling research datasets as unique, independent, citable research objects offers a wide variety of opportunities.

The goal of the new DataCite cooperation is to establish a not-for-profit agency that enables organisations to register research datasets and assign persistent identifiers to them.

Citable datasets are accessible and can be integrated into existing catalogues and infrastructures. A citable datasets furthermore rewards scientists for their extra-work in storage and quality control of data by granting scientific reputation through cite-counts. The workshop will examine the different methods for the enabling of citable datasets and discuss common best practices and challenges for the future.

See the full programme [PDF]

Repository Preservation Infrastructure (REPRISE)
(co-organised by the OGF Repositories Group, OGF-Europe, D-Grid/WissGrid)

Following on from the successful Repository Curation Service Environments (RECURSE) Workshop at IDCC 2008, this workshop discusses digital repositories and their specific requirements for/as preservation infrastructure, as well as their role within a preservation environment.

Data and the journal article

I recently had a discussion (billed as a presentation, but it was on such an (ahem) intimate scale that it became a discussion) at Ithaka, the organisation in New York that runs JSTOR, ArtSTOR and Portico. We talked about some of the issues surrounding supporting journal articles better with data. Both research funders and some journals are starting to require researchers/authors to keep and to make available the data that supports the conclusions in their articles. How can they best do this?

It seems to me that there are 4 ways of associating data with an article. The first is through the time-honoured (but not very satisfactory) Supplementary Materials, the second is through citations and references to external data, the third is through databases that are in some way integrated with the article, and the fourth is through data encoded within the article text.

My expectation was that most supplementary materials that included data would actually be in Excel spreadsheets, and a few would be in CSV files, while even fewer would be in domain-specific, science-related encodings. I was quite shocked after a little research to find, at least for the Nature journals I looked at, that nearly all supplementary data were in PDF files, while a few were in Word tables. I don't think I found any that were Excel, let alone CSV. This doesn't do much for data re-usability! As things stand currently, data in a PDF document (eg in tables) will probably need to be extracted by hand copy; possibly by cut and paste followed by extensive hand manipulation.

I would expect that looking away from the generalist journals towards domain-specific titles, would reveal more appropriate formats. However, a ridiculously quick check of Chem-Comm, a Royal Society of Chemistry title, showed supplementary data in PDF even for an "electronically enhanced article (eg Experimental procedures, spectra and characterization data, perhaps not openly accessible...).

There’s a bit of concern in some quarters about journals managing data, particularly that data would disappear behind the pay wall, limiting opportunities for re-use.

What would be ideal? I guess data that are encoded in domain-specific, standardised formats (perhaps supported by ontologies, well-known schemas, and/or open software applications) would be pretty useful. I’ve also got a vague sense of unease about the lack of any standardised approach to describing context, experimental conditions, instrument calibrations, or other critical metadata needed to interpret the data properly. This is a tough area, as we want to reduce the disincentives to deposit as well as increase the chances of successful re-use.

Clearly there are many cases where the data are not appropriate for inclusion as supplementary materials, and should be available by external reference. Such would be the case for genomics data, for example, which must have been deposited in an appropriate database (the journal should demand deposit/accession data before publication).

External data will be fine as long as they are on an accessible (not necessarily open) and reasonably permanent database, data centre or repository somewhere. I do worry that many external datasets will be held on personal web sites. Yes, these can be web-accessible, and Google-indexed, but researchers move, researchers die, and departments reorganise their web presence, which means those links will fail, and the data will disappear (see the nice Book of Trogool article "... and then what?").

Sometimes such external data can be simply linked, eg a parenthetical or foot-noted web link, but I would certainly like to encourage increasing use of proper citations for data. Citations are the currency of academia, and the sooner they accrue for good data, the sooner researchers will start to regard their re-usable data as valuable parts of their output! It’s interesting to see the launch of the DataCite initiative coming up soon in London.

There is this interesting idea of the overlay data journal, which rather turns my last paragraph on its head; the data are the focus and the articles describe the data. Nucleic Acids Research Database Issue articles would be prime examples here in existing practice, although they tend to describe the dataset as a persistent context, rather than as the focus for some discovery. The OJIMS project described a proposed overlay journal in Meteorology; they produced a sample issue and a business analysis, but I’m not sure what happened then.

The best (and possibly only) example I know of the database-as-integral-part-of-article approach is Internet Archaeology, set up in 1996 (by the eLib programme!) as an exemplar for true internet-enabled publishing. 13 years later it's still going strong, but has rarely been emulated. Maybe what it provides does not give real advantages? Maybe it's too risky? Maybe it’s too hard to create such articles? Maybe scholarly publishing is just too blindly conservative? I don't know, but it would be good to explore in new areas.

Peter Murray-Rust has argued eloquently at the tragedy of data trapped and rendered useless in the text, tables and figures of articles. We would like to see articles semantically enriched so that these data can be extracted and processed. Encoded data points us to a few examples, such as the Shotton enhanced article described in Shotton et al 2009, and also to the Murray-Rust/Sefton TheoREM-ICE approach (although that was designed for theses, I think). I think the key here is the lack of authoring tools. It is still rather difficult to actually do this stuff, eg to write an article that contains meaningful semantic content. The Shotton target article was marked up by hand, with support from one of the authors of the W3C SKOS standard, ie an expert! The chemists having been working on tools for their community, both the ICE example, and also MS Chem4Word, maybe ChemMantis, etc.

This last paragraph also points us towards the thesis area; I think this is one that Librarians really ought to be interested in tackling. What is the acceptable modern equivalent to the old (but never really acceptable) practice of tucking disks into a pocket inside the back cover of a thesis? Many universities are now accepting theses in digital form; we need some good practice in how to deal with their associated data.

So, we seem to be quite a way from universal good practice in associating data with our research articles.

Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Computational Biology, 5(4). doi: 10.1371/journal.pcbi.1000361.

Friday 13 November 2009

5th International Digital Curation Conference : Register Now!

Hear ye, hear ye! [Shameless promotion here, but with useful information embedded!]

Time to register for this premier curation event, coming up in London, in the first week in December. We have a great programme this year, with Douglas Kell, head of BBSRC as the opening Keynote, and Timo Hannay of Nature as the closing keynote. In between we have perspectives on scale from US viewpoints, particularly the two large NSF-funded Datanet projects, and from the UK with reports linked to neurosciences and social simulation.

In the first afternoon we have our popular Minute Madness, followed by the Community Space: part of the conference shaped by you, plus a symposium on citizen science.

The second day has a wide range of interesting papers. Do you want to know how curation is being tackled in some US universities? The implications of Chronopolis or CASPAR? What those Australians are doing in data curation? How to preserve software, or to do emulation a bit better? What metadata might be appropriate for scientific datasets, or how to extract metadata from resources better? What are the information requirements of Life Sciences, or the Arts and Humanities? How to curate a database that’s constantly changing? Then come to Kensington in December!

Nearly forgot to mention the pre-conference workshops, some of which deserve blog posts of their own.