Wednesday 30 May 2007

Supplementary Material in journal publishing

I went to a couple of interesting events in March and April. The first was the eJournal Archiving & Preservation Workshop held at the British Library on 27 March, 2007 (sponsored by the Digital Preservation Coalition, BL and JISC). The keynote was Anne Kenney talking about the "Metes and Bounds" report on eJournal preservation, and there were many good talks. But the interesting thing in this context was how often the problem of "supplementary materials" came up again and again in questions. This meant all sorts of things, including multimedia, but in particular this related to associated data. A good example of a journal that specifically aims to exploit the Internet's capabilities in providing access to supplementary materials (rather than just transporting two-column PDFs around for me to print off) is Internet Archaeology. The problem of course is that this kind of material falls outside the bounds of things that can be easily preserved with the developing preservation systems.

The next event was an ALPSP workshop on the transformation of research communication, on 13 April at the IMechE in London. This event was mainly aimed at publishers, but again supplementary materials and specifically data came up several times in questions and panel sessions, although this time data also came up in presentations as well (eg Christine Borgman and Simon Coles).

Of course, while having a journal literature that enables connection to underlying data is a pretty neat idea, not everyone is happy with the publishers taking control of that data and adding it to their lock-in of scholarly intellectual property. Generally I agree with these concerns, but we have yet to establish a sustainable alternative model for providing continuing access to these data either. I think this one will run and run!

Tuesday 29 May 2007

Life cycle approach

Digital Curation is maintaining and adding value to a trusted body of digital information over the life cycle of scholarly and scientific materials, for current and future use. It is our belief in the DCC that the curation of digital data requires this whole of life approach. Critical decisions on the curation of data are taken before the data are even created, often at the time the associated project is conceived, or funding is sought. This is not least because curation requires resources that must be allowed for within the work plan. It is increasingly clear that for any project involving data of value, you should provide a data management plan within the project proposal (NSF, 2007).

Digital curation includes good management of data for current purposes, and also in many cases the preservation of those data for the long term. Long term preservation is not necessarily an essential part of curation in all cases, although it is usually a desirable aspect (subject to appraisal and selection decisions). So we can think of curation as having two important components, which we can label “data publication”, for the process of making current data available for use by other contemporaries, and “data preservation”, for the process of making those data available for future users.

Data publication recognises that more and more “reference” works are migrating into the digital domain as curated databases, and that increasingly these are data (or sometimes combinations of data and text) rather than pure text. There are interesting questions on when and how a dataset is “published” such as those raised by Bryan Lawrence in the linked post, but I’ll skip those for now! Such reference datasets can change quite frequently, including the correction or deletion of information as well as the addition of new. The requirements for integrity and stability versus the need for change to promote accuracy bring special problems. During development of the resource, if you are interested in the long term, you will have to ensure that contextual and other information needed for preservation are gathered. This may have to happen even before you make firm decisions about preservation!

As your dataset stabilises, and in particular as it comes out of current use, it may be eligible for long term preservation. This is not an automatic choice, as resources are currently spread too thinly to preserve everything. It will be important in any data management plan to identify candidate archives for preservation that serve the appropriate “designated community” (perhaps your scientific discipline).

The locus of preservation remains an issue. Where applicable and stable, a discipline-oriented repository should ensure that selected data are curated for the long term by domain experts (unfortunately the recent disturbing decision by the AHRC to end funding of the AHDS provides a worrying precedent). The choice of repository will be important in deciding in the first case what associated information is required, and later in treating the data at critical stages in the life cycle, such as changes in the technological infrastructure or the designated community.

Should there be no discipline-oriented repository, an institutional or other local repository may be appropriate… but may be quite un-prepared to deal with data (on a recent search using OpenDOAR, only 5 institutional repositories in the UK claimed to include data, and some of those claims were false)! Keep at them…

In both cases, ensuring that users in your designated community can easily understand the meaning or information content of the data is critical. The Open Archival Information System (CCSDS, 2002) has useful recommendations for supporting the long term understandability of information, and is applicable to digital curation for that reason, although at the moment we are short of good examples of some of its concepts, and it does require some interpretation!

CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.
NSF (2007) Cyberinfrastructure Vision for 21st Century Discovery. Arlington, Virginia, National Science Foundation.

Introduction to the Digital Curation Blog

The Digital Curation Centre runs a Forum for its Associate Network community, based on bulletin board technology (see This exercise has been a limited success; although we have many Associates Network members, and some considerable traffic reading the Forum, there has not been a substantial and frequent posting, and overall most posts have been from DCC staff. The Forum has been comparatively successful compared with some other such ventures in the JISC community, but its success has not satisfied us. We have tried two approaches to improving things: the first was to simplify the structure of the Forum, and the second was to begin to treat it rather more like a blog.

The latter has some appeal, but (as usual) needs more sustained effort than we have been able to provide. It also became clear that despite a RSS feed, the “Forum as blog” did not integrate well into the blogosphere. Nevertheless we think a blog is the way to go, and so we have decided to trial two blogs. The first is the DCC Blawg, providing “snippet” information about relevant legal developments relating to data and curation. The second is this Curation Blog.

The aim here is to write a series of short pieces exploring aspects of curation and preservation, particularly some of the more puzzling issues, and also reflecting on events we have attended. Initially I will be the author, but we will decide whether to expand it to other DCC authors later.

Among the issues I am interested in writing about are:

- the lifecycle approach to digital curation, leading towards our curation white paper.

- a whole lot of stuff about repositories, and curation implications.

- interesting events, currently including one on preserving e-Journals (where associated data featured strongly), and one on publishing, where the same thing happened. We also had an interesting research workshop on preserving databases.

- The relationship between data and information is a current topic of some interest to me. Some of this relates to the relationship between models and reality, some to the social construction of data, some more concretely to what it means to publish data, some to the relationships between documents and databases. And these thoughts lead on to, or back from…

- OAIS (the Open Archival Information System) and its implications for curation are also interesting. Although it seems to me that OAIS has an “end of life” focus while curation has an “all of life” focus (both of those being shorthand for more complicated thoughts), nevertheless we are convinced that OAIS has value. Key to it is the idea of representation information, which is everything that connects data to information for a designated community. We’re still a bit short of concrete examples for these concepts, despite the wide acceptance in general terms of OAIS.

- I’m also very interested in data citation, and there are many issues to explore here.

- The National Library of Medicine Permanence Ratings are an interesting idea that could do with further exploration.

- Sustainability issues are of continuing concern: both economic (cost and value) and societal, as well as technical.

- And all this without yet mentioning annotation provenance, workflows, GRIDS, trusted digital repositories, etc, etc!

So there’s plenty to write about. Can we keep it up? So far my biggest concern has been the amount of time required to write a shortish piece with some value, making connections between resources and discussions elsewhere on the web, including a reasonable amount of checking. Half a day at a time can easily be consumed in one piece, and there aren’t enough of those spare. But nevertheless, I’m going to try. Feedback would be nice, from time to time!