Thursday, 20 November 2008

Curation services based on the DCC Curation Lifecycle Model

I’ve had a go at exploring some curation services that might be appropriate at different stages of a research project. I thought it might also be worth trying to explore curation services suggested by the DCC Curation Lifecycle Model, which I also mentioned in a blog post a few weeks ago.

The model claims that it “can be used to plan activities within an organisation or consortium to ensure that all necessary stages are undertaken, each in the correct sequence… enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement”. So it does look like an appropriate place to start.

At the core of the model are the data of interest. These should absolutely be determined by the research need, but there are encodings and content standards that improve the re-usability and longevity of your data (an example might be PDF/A, which strips out some of the riskier parts of PDF and which should make the data more usable for longer).

Curation service: collected information on standards for content, encoding etc in support of better data management and curation.
Curation service: advice, consultancy and discussion forums on appropriate formats, encodings, standards etc.

The next few rings of the model include actions appropriate across the full lifecycle. “Description and Representation Information” is maybe a bit of a mouthful, but it does cover an important step, and one that can easily be missed. I do worry that participants in a project in some sense “know too much”. Everyone in my project knows that the sprogometer raw output data is always kept in a directory named after the date it was captured, on the computer named Coral-sea. It’s so obvious we hardly need to record it anywhere. And we all know what format sprogometer data are in, although we may have forgotten that the manufacturer upgraded the firmware last July and all the data from before was encoded slightly differently. But then 2 RAs leave and the PI is laid up for a while, and someone does something different. You get the picture! The action here is to ensure you have good documentation, and to explicitly collect and manage all the necessary contextual and other information (“metadata” in library-speak, “representation information” when thinking of the information needed to understand your data long term).

Curation service: sharable information on formats and encodings (including registries and repositories), plus the standards information above.
Curation service: Guidance on good practice in managing contextual information, metadata, representation information, etc.

Preservation (and curation) planning are necessary throughout the lifecycle. Increasingly you will be required by your research funder to provide a data management plan.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Community Watch and Participation is potentially a key function for an organisation standing slightly outside the research activity itself. The NSB Long-lived data report suggested a “community proxy” role, which would certainly be appropriate for domain curation services. There are related activities that can apply in a generic service.

Curation service: Community proxy role. Participate in and support standards and tools development.
Curation service: Technology watch role. Keep an eye on new developments and critically obsolescence that might affect preservation and curation.

Then the Lifecycle model moves outwards to “sequential actions”, in the lifecycle of specific data objects or datasets, I guess. As noted before, curation starts before creation, at the stage called here “conceptualise: Conceive and plan the creation of data, including capture method and storage options”. Issues here perhaps covered by earlier services?

We then move on to “doing it”, called here “Create and Receive”. Slightly different issues depending on whether you are in a project creating the data, or a repository or data service (or project) getting data from elsewhere. In the former case, we’re back with good practice on contextual information and metadata (see service suggested above); in the latter case, there’s all of these, plus conformance to appropriate collecting policies.

Curation service: Guidance on collection policies?

Appraisal is one of those areas that is less well-understood outside the archival world (not least by me!). You’ve maybe heard the stories that archivists (in the traditional, physical world) throw out up to 97% of what they receive (and there’s no evidence they were wrong, ho ho). But we tend to ignore the fact that this is based on objective, documented and well-established criteria. Now we wouldn’t necessarily expect the same rejection fraction to apply to digital data; it might for the emails and ordinary document files for a project, but might not for the datasets. Nevertheless, on reasonable, objective criteria your lovingly-created dataset may not be judged appropriate for archiving, or selected for re-use. My guess is that this would often be for reasons that might have been avoided if different actions had been taken earlier on. Normally in the science world, such decisions are taken on the basis of some sort of peer review.

Curation service: Guidance on approaches to appraisal and selection.

At this point, the Lifecycle model begins to take on a view as from a Repository/Data Centre. So whereas in the Project Life Course view, I suggested a service as “Guidance on data deposit”, here we have Ingest: the transfer of data in.

Curation service: documented guidance, policies or legal requirements.

Any repository or data centre, or any long-lived project, has to undertake actions from time to time to ensure their data remain understandable and usable as technology moves forward. Cherished tools become neglected, obsolescent, obsolete and finally stop working or become unavailable. These actions are called preservation actions. We’ve all done them from time to time (think of the “Save As” menu item as a simple example). We may test afterwards that there has been no obvious corruption (and will often miss something that turns up later). But we’ve usually not done much to ensure that the data remain verifiably authentic. This can be as simple as keeping records of the changes that were made (part of the data’s provenance).

Curation service: Guidance on preservation actions
Curation service: Guidance on maintaining authenticity
Curation service: Guidance on provenance

I noted previously that storing data securely is not trivial. It certainly deserves thoughtful attention. At worst, failure here will destroy your project, and perhaps your career!

Curation service: Briefing paper on keeping your data safe?

Making your data available for re-use by others isn’t the only point of curation (re-use by yourself at a later date is perhaps more significant). You may need to take account of requirements from your funder or publisher or community on providing access. You will of course have to remain aware of legal and ethical restrictions on making data accessible. The Lifecycle Model calls these issues “Access, Use and Reuse”.

Curation service: Guidance on funder, publisher and community expectations and norms for making data available
Curation service: Guidance on legal, ethical and other limitations on making data accessible
Curation service: Suggestions on possible approaches to making data accessible, including data publishing, databases, repositories, data centres, and robust access controls or secure data services for controlled access
Curation service: provision of data publishing, data centre, repository or other services to accept, curate and make available research data.

Combining and transforming data can be key to extracting new knowledge from them. This is perhaps situation-specific, rather than curation-generic?

The Lifecycle model concludes with 3 “occasional actions”. The first of these is the downside of selection: disposal. The key point is “in accordance with documented policies, guidance or legal requirements”. In some cases disposal may be to a repository, or another project. In some cases, the data are, or perhaps must be destroyed. This can be extremely hard to do, especially if you have been managing your backup procedures without this in mind. Just imagine tracking down every last backup tape, CD-ROM, shared drive, laptop copy, home computer version etc! Sometimes, especially secure destruction techniques are required; there are standards and approaches for this. I’m not sure that using my 4lb hammer on the hard drives followed by dumping in the furnace counts! (But it sure could be satisfying.)

Curation Service: Guidance on data disposal and destruction

We also have re-appraisal: “Return data which fails validation procedures for further appraisal and reselection”. While this mostly just points back to appraisal and selection again, it does remind me that there’s nothing in this post so far about validation. I’m not sure whether it’s too specific, but…

Curation service: Guidance on validation

And finally in the Lifecycle model, we have migration. To be honest, this looks to me as if it should simply be seen as one of the preservation actions already referred to.

So that’s a second look at possible services in support of curation. I’ve also got a “service-provider’s view” that I’ll share later (although it was thought of first).

[PS what's going on with those colours? Our web editor going to kill me... again! They are fine in the JPEG on my Mac, but horrible when uploaded. Grump.]

1 comment:

  1. Chris, You are right to say that "migration" is one of the preservation actions which might be performed, but it is highlighted separately in the Model because it leads to the creation of a new dataset which has, in turn to be curated (thus re-entering the circle). Migration aims to make an authoritative copy of the data to preserve it from obsolesence. Good practice would keep both the original data stream and the migrated data. This is distinct from data transformation which also creates a new set of data which needs to be curated, but may be eg a subset, or an annotated set of the original data to perform a different purpose.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.