Two fundamentally different views on data curation

A few months ago, we interviewed two scientists for two quite different posts in the DCC. Both were from a genomics background, and I was very struck by how strongly held, but from my point of view how narrow, was their view of curation. As a result, I’m beginning to realise that there are two fundamentally different approaches to data curation

The Wikipedia definition of biocurator was quoted by Judith Blake, the keynote speaker [slides] at the 2nd International BioCuration Meeting in San Jose:
“A biocurator is a professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases. The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database inter-operability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories.

Biocurators (also called scientific curators, data curators or annotators) have been recognized as the ‘museum catalogers of the Internet age’ [Bourne & McEntyre]”
In essence, this suggests that curation (let’s call it biocuration for now) is the construction of authoritative annotations linking significant objects (eg genes) with evidence about them in the literature and elsewhere. And this is indeed the sort of thing you see in genomics databases.

In other parts of science, I believe curation is not so much about constructing annotation on objects, but rather about caring for arbitrary datasets (adding descriptions, information about how to use them, transferring them to longer term homes, clarifying conditions of use and provenance information, but yes including annotations, etc), so that they can be used and re-used, by the originators or others, now or in the future. This approach incorporates elements of digital preservation, but is not solely defined by “long term” in the way that digital preservation tends to be. Clearly we will find cases where we need to curate datasets which themselves result from biocuration!

The problem is that both terms are well-embedded in their respective communities. I don’t think either of my two interviewees had any idea how much broader our concept was than theirs. We need to watch out for the inevitable misunderstandings that can arrive from this clash of meanings! However, both communities are likely to continue to use the terms in their own ways (although perhaps this term biocurator, relatively new to me, is an effort to disambiguate).

