Thursday 12 July 2007

Very long term data

Rothamsted Research is an agricultural research organisation based near Harpenden in England. There are many interesting features of this organisation (only a few of which I know), including its “classical experiments". One of these, started in 1843, must surely be one of the longest-running experiments with resulting time-series data anywhere. I visited this week, spoke to Chris Rawlings and others handling the data for this “Broadbalk” experiment on wheat yields, and also a couple of scientists working on somewhat younger experiments collecting moths (1930s) and aphids (1960s) on a daily basis (using light traps and vacuum traps respectively).

Digital preservation theory tells us that digital data are at risk from various kinds of changes in the environment. Most often we focus on the media, or the risk of format incompatibility. OAIS rightly asks us to think about contextual metadata of various kinds, but also recognises that there is risk of semantic drift and/or semantic loss rendering once-clear resources incomprehensible (this is the business about the Designated Community and its Knowledge Base that I wrote about recently). The OAIS model seems to me (though some argue against this view) premised on a preservation or perhaps archival view: resources are ingested, preserved within the archive, and then disseminated at a later date. I’ve always had doubts about how easily this fits with a more continuous curation model, where the data are ingested, managed, preserved and disseminated simultaneously.

However, intuitively even in the curation situation I have described, if “long enough: time passes, many of the concerns of digital preservation will apply. And the Broadbalk wheat yield experiment since 1843 is certainly long enough. In that time there has been semantic drift (words mean different things), changes in units (imperial to metric), changes in plot size/granularity and plot labelling more than once, changes in what is measured (eg dropping wet hay weight while keeping dry yield) and how it is measured (the whole plot or a sample), changes in accuracy of measurement, and many changes in treatments. These are very serious changes that could significantly affect interpretation and analysis.

In the case of the aphids experiment, I saw a log book in which changes in interpretation etc were meticulously recorded. Unfortunately I don't have a copy of any pages from that book to describe in more detail the kinds of issues it raised. Although weekly aphid bulletins are made available (explicitly not "published"!), the book itself has, I believe, not been published. My guess is that this book forms a critical part of the provenance information by which the quality of the data can be judged.

Of course, the Broadbalk experiment has not been digital since 1843. I didn’t see the records, but the early ones would have been longhand in ledgers or notebooks, and later perhaps typed up. I get the impression it was quite systematic from the beginning, so converting it to an Oracle database in around 1991 was a feasible if major task. Now they are planning to convert it to a new database system and perhaps make the data more widely available, so they are asking questions about how they should better handle these changes.

By the way, in one respect the experiments continue to be decidedly non-digital. Since the beginning, they have been collecting soil samples from the plots, and now have over 10,000 of them! This makes a fascinating physical collection as a reference comparator for the digital records.

I should say that they have done a lot of hard thinking and good work about some of these issues. In particular, they have arranged all their input data into batches (known as “sheets”), which sounds pretty much self-describing in what is effectively a purpose-built data description language. This means they can roll back and roll forward their databases using different approaches. These sheets are all prepared using the assumptions of their time (remembering that some were prepared more than a hundred years after the data they contain was first recorded), but in theory there is enough information about these assumptions to make decisions. So once they have decided how to deal with some of these changes, they are able to do the best job possible of implementing it.

A Swedish forestry scientist once argued forcefully, in the discussion session to a presentation, that the original observations were sacrosanct, must be kept and should not be changed. Do what you like with the analysis, he suggested: re-run it with varied parameters, re-analyse with new models, (implicitly) make the sorts of changes suggested here. This approach would suggest starting a new time series dataset whenever there is a change of the sort we have been describing. That’s not exactly what has been happening with the Broadbalk experiment, but the sheets approach is if anything even finer grained.

I’m interested to collate experience from other long-running experiments that may have faced these and related issues. So far I have heard of the Harvard Forest project, part of the NSF Long Term Ecological Research (LTER) Network (thanks Raj Bose), and some very long-running birth cohort studies in the social sciences. Further suggestions on comparators and literature to look at would be welcome.

1 comment:

  1. DCB readers interested in finding more about the long term datasets that Chris described might like to know that an overview of the Rothamsted Long Term experiments has been published recently.

    http://www.rothamsted.bbsrc.ac.uk/resources/LongTermExperiments.pdf

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.