Thursday, 14 June 2007

Building expertise in digital curation and preservation

It’s curious when something new has been around for 30 years or so. However, that’s the paradoxical case with digital curation. The term itself is relatively new, and there is still (as usual) confusion as to what exactly it means. But taking the simple definition (maintaining and adding value to a trusted body of digital information over the life-cycle of scholarly and scientific materials, for current and future use), it is clear that some disciplines have had organisations doing this for many years. The Social Sciences have had archives pretty much as long as they have had computing facilities (eg UKDA, ICPSR), and there are environmental science archives dating from at least the early years of satellite technology (eg National Satellite Land Remote Sensing Data Archive and NOAA). In both these cases, the unrepeatable nature of the observations has been a major factor.

One interesting aspect of this is that in those and other disciplines, the archives invented what is becoming digital curation, separately and for themselves. There was often communication among the separate archives in a discipline, but not necessarily with other disciplines, even if quite closely related.

Another strange factor is how this isolation of mindset carries on in other guises today. One can occasionally hear discussions on how a data archive is not a repository (eg it doesn’t support the OAI-PMH!). There have been several contemporary groups doing related things in almost complete isolation from each other. However, I think this tendency is diminishing, and we certainly hope to aid this process!

In fact we could recognise at least the following as elements in some sort of repository space, relating to the curation of science and research knowledge:

• Physical repositories, such as archives, libraries and museums. There is a physical-digital boundary, but in some ways it ought to be irrelevant to the information user. Information should certainly be accessible from either domain, and the valuable skills of curating the information should be deployed from both.

• Institutional, subject and discipline repositories based on software supporting the Open Access Initiative principles and its Protocol for Metadata Harvesting (OAI-PMH). DSpace, FEDORA and ePrints are often seen as the prime technology candidates here (described in a page from the EThOS project), and there are examples in the UK of institutional repositories built on all of them. Subject/discipline repositories tend to be much more DIY; maybe because they mainly pre-date the OAI phase. However, this is where long term preservation repositories linking themselves to the Open Archival Information System standard (OAIS) also belong, as Trusted Digital Repositories (TDRs). There are no archetypical TDRs at this point, although there are some candidates (eg DAITSS and kopal), and a number of suppliers who will build a TDR to requirements.

• Databases, as in the bio-sciences for example. It’s clear that databases will play an increasing role in the future, whether as relational, object, XML, RDF or in some other form. Much scientific data is held in databases (particularly if we are flexible and count a product like Excel as a primitive database).

• Shared document repositories with their supporting infrastructure. SourceForge is probably the best-known example, heavily used in the open source software field. Google Docs may provide a related capability for more “office-like” objects. Wikis, perhaps bulletin board systems (like the DCC Forum), maybe email list archives and to an extent blogs fit somewhere in this space.

• Perhaps file systems (distributed, networked, shared) can fall into this space as well. While this would neglect many important aspects of curation above the bit stream level, a persistent (and therefore redundantly distributed) file store is a requirement for long term curation of data.

• Web space of some kinds can also qualify (wikis fit here as well). How often do we meet the argument that there’s no point in putting some object in a repository “because it’s on my web site”? One of Andy Powell's arguments at the JISC Repositories Conference was that we should make repositories much more like the good Web 2.0 web sites: the ones that the public like to use, find fun, and see value in.

• Finally perhaps we would include data grids, used for some kinds of applications in e-Science and e-Research more generally. These can be massive distributed data systems, where the user need not know “where” the data are; theoretically an economic decision can be taken on the fly as to whether to move the computation to the data, or the data to the computation. SRB and OGSA-DAI (apples and oranges, I know)are the technology candidates here.

The point of this list is that it is clear that each of these “spaces” has its own traditions and associated skill sets, its adherents and detractors, its own advantages and disadvantages (which will not be the same for all users and cases). It would be helpful if we could abstract the good and problematic qualities from each and inform the others so that problems can be overcome and good practice spread.

This is clearly a significant role for the Digital Curation Centre. Our activities in assembling resources, building tools and providing events have been building towards this end of identifying and sharing good practice amongst practitioners.


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.