Digital Curation Blog: December 2007

Sunday 16 December 2007

L'enfer, c'est les metadata des autres...?

Some reflections and connections on IDCC3, where it was said that l’enfer, c’est les metadata des autres…
As a venue for the IDCC3 pre-conference reception, the National Museum of the American Indian would be hard to surpass, but on Tuesday evening I found myself reflecting on its relevance to an international conference dedicated to the more contemporary challenges of digital curation – assuming, of course, that relevance was a legitimate preoccupation when we were gathered in the midst of such a compelling array of historical artefacts. The connection soon became obvious when, having climbed up through the circuitous galleries, and experiencing an almost seamless transition from watercraft to weaponry to kitchenware and costume, the sense of continuity in the culture of the Indian nations, preserved here and made so very accessible, forged the link with our own gathering. Here, for example, were ceremonial dresses from past centuries displayed alongside their counterparts from the present day, and the motifs from artwork used to celebrate births, battles and harvests from as long ago as 400 years were repeated in the modern representations of the cultural tradition. The only recognisable difference was the brightness of the new beads and the tautness of the fresh braiding.
So what had sustained this cohesion of style and story across several epochs? – a powerful sense of a shared identity, for one, an identity strengthened as the nations of the Chesapeake joined together in the face of massive change, which was being forced upon them by the invasion from Europe. And then there was the simple opportunity to preserve and replicate the physical totems that were important to them for practical or spiritual reasons, three-dimensional objects that could be passed from one hand to another, along with the thoughts and words that were carried by an unbroken oral tradition.

So is that all it took? – that and the dedication of the Museum’s curators, of course. The benefits of a very visible cultural record, the devotion of a community with an allegiance to that culture, it all seemed a long way away from the ‘inhuman’ world described by Rhys Francis the following day, when he referred to the scale of the data deluge as being outside all previous human experience, so vast but invisible.

In his own presentation about the Arts & Humanities, later on Day 1, Myron Gutmann made a valuable contribution to the definition of curation when he referred to it as the preparation of data for preservation and re-use. It also marked the difference between what we were discussing and the manner in which the Indian cultural record had survived. There had been no thoughts of preparation there, only a sense of being. Yet while we might surmise that they had observed no pressing need to devise rules and organisation and standards to ensure preservation and re-use, not when everything was all so familiar and tangible, we can celebrate their concern to preserve the metadata – the designs for the war bonnets, their patterns for the rain dance gowns – and this was metadata with a common currency, free from the threat posed in Clifford Lynch’s reworking of Sartre’s definition of hell.

In that closing keynote speech on Day 2, Clifford Lynch exhorted us to be mindful of the several aspects of scale. Whereas there are huge amounts and types of data being generated, we must take care to understand the characteristics and value of data sets, irrespective of their individual scale, for one person’s Gigabyte may be another’s Petabyte, and data should be selected for preservation with reference to criteria other than mere size. Perhaps, he suggested, one solution that begs more serious consideration is the option to re-compute as an alternative to preservation – just keep the metadata, in a repository of course, not hell – although that strategy would assume the enabling software will remain functional. In parallel to this possibility he pointed out that what we hadn’t discussed over the past two days was loss – i.e. what could we afford to lose? It struck me that perhaps whilst that would be a difficult luxury for our own time it was an even greater imperative for the Indian nations. Reduced and dispersed from their homelands they had already lost much, but there was a feeling that there was yet more that could not be broken.

Carol Goble had opened Day 2 of the conference by declaring that in the realm of digital data we do not have a stable identity or environment. She referred to the traditionally selfish culture of scientific research as Me-Science, and pleaded for a move to We-Science, where there would be true collaboration and sharing, with the reinforcement of tribal bonding and the crossing of tribal boundaries. Had she too been reflecting on the history of the North American Indian, the joining together of the Nanticoke, Piscataway and all those other tribes as the wooden ships from Europe set anchor in the Chesapeake? Or was this simply an axiomatic reflection of the way things always shake down in the end?

At this conference we heard a lot about who should be doing what to ensure the embedding of good curation methodology, standards and processes in the world of e-science. At times it seemed overly top down and fraught with the risk of being unworkable on the grounds of insufficient investment or buy-in from scientific practitioners. We also heard about smart new solutions that were bubbling up from within the e-science community, fit for purpose, potentially short term, and with the credentials that they had been put together by the discipline communities themselves, in order to resolve some very real issues that they were facing.

Oh well, that brings me back to the Indians again…

Friday 14 December 2007

Murray-Rust on Digital Curation Conference day 2

Peter Murray-Rust has written two blog posts (here and here; I'm not sure if those are permanent URL's...) about day 2 of the International Digital Curation Conference in Washington DC. Thanks, Peter.

In the first post, he began:

"There is a definitely an air of optimism in the conference - we know the tasks are hard and very very diverse but it’s clear that many of them are understood."

He then picked up on Carole Goble's presentation on workflows. Here are a few random extracts from Peter's random jottings (his description):

"The great thing about Carole is she’s honest. Workflows are HARD. They are expensive. There are lots of them. Not of them does exactly what you want. And so on. [PMR: We did a lot of work - by our standards - on Taverna but found it wasn’t cost-effective at that stage. Currently we script things and use Java. Someday we shall return.]"

"myExperiment.org. A collaborative site for workflows. You can go there and find what you want (maybe) and find people to talk to. “- bazaar for workflows, encapsulated objects (EMO) single WFs or collections, chemistry data with blogged log book, encapsulatd experimental objects Open Linked Data linked initiative…"

"Scientists do not collaborate - scientists would rather share a toothbrush [...] than gene names (Mike Ashburner)
who gets the credit? - who is allowed to update?. Changing metadata rather than data. Versioning. Have to get credit and reputation managed. Scientitsts are driven by money, fame, reputation, fear of being left behind"

"Annotations are first class citizens"

His second post covers Jane Hunter and Kwok Cheung's presentation on compound document objects (CDOs):

"Increasing pressure to share and publish data while maintaining competitiveness.
Main problem lack of simple tools for recording, publishing, standards.
What is the incentive to curate and deposit? What granularity? concern for IP and ownership"

"Current problems with traditional systems - little semantic relationship, little provenance, little selectivity, interactivity , flexibility and often fixed rendering and interfaces. No multilevel access. either all open or all restricted
usually hardwired presentation"

"Capture scientific provenance through RDF (and can capture events in physical and digital domain)
Compound Digital Objects - variable semantics, media, etc.
Typed relationships within the CDOs. (this is critical)"

"SCOPE [the tool Jane & Kwok have developed is] a simplified tool for authoring these objects. Can create provenance graphs. Infer types as much as possible. RSS notification. Comes with a graphical provenance explorer."

Thanks again, Peter!

Thursday 13 December 2007

Murray-Rust on Digital Curation Conference day 1

Peter Murray-Rust has blogged on our conference in Washington DC:

Overall impressions - optimistic spirit with some speakers being very adventurous about what we can and should do.

He paid particular attention to the national perspective from Australia:

"…Rhys Francis (Australia) . One of the most engaging presentations. Why support ICT? It will change the world. Systems can make decisions, electronic supply chains. humans cannot keep pace. We do not need people to process information.

Who owns the data, and the copyright = physics says, who cares? If it’s good, copy it. Else discard it. Storage is free. [PMR: let’s try this in chemistry…]

What to do about data is a harder question than how to build experiements

FOUR components to infrastructure. I’ve seen this from OZ before… data, compute, interoperation, access. Must do all 4. -

And he also highlighted the importance of domain knowledge in preference to institutional repositories (I go along with that).

And I thought this was interesting, relating to Tim Hubbard's presentation:

"particular snippet: Rfam (RNA family) is now in Wikipedia as primary resource with some enhancement by community annotation. So we are seeing Wikipedia as the first choice of deposition for areas of scholarship."

So it looks like things are going well!

Wednesday 12 December 2007

DCC Forum: Comment on Conference best paper

There is a Conference theme over on the DCC Forum. However, despite its RSS feed, it doesn't appear to be quite in the blogosphere, so I thought I would post some quotes from it here. Bridget Robinson announced the best paper selection (it gets a star spot in the programme):

"The title of best peer-reviewed paper has been awarded to "Digital Data Practises and the Long Term Ecological Research Programme" by Helena Karasti (Department of Information Processing, University of Oulu in Finland), Karen S Baker (Scripps Institution of Oceanography, University of California, USA) and Katharina Schleidt (Department of Data and Diagnoses, Umweltbundesant, Austria)."

Angus Whyte commented:

"Can't make it to Washington, but looking forward to seeing this paper. I'm familiar with the 'Enriching the Notion of Data Curation in e-Science' paper by the same authors (plus Eija Halkola) in the journal Computer Supported Cooperative Work last year [*}. (doi:10.1007/s10606-006-9023-2); and some of Helena Karasti's previous work to integrate ethnographic and participatory design methods in CSCW. Recently colleagues and I at DCC began case studies for the SCARP project, which I hope can draw lessons from their work to document "the inter-related and continuously changing nature of technology, data care, and science conduct" as they put it.

* Volume 15, Number 4 / August, 2006

If there are any coments here I will re-post a summary back on the Forum!

Anticipating day 1 of the International Digital Curation Conference

Starting in 3 hours or so in Washington DC. I'm quite interested in the various different national approaches being discussed in the morning. Two questions I would like to ask:

There is a clash between institutional and subject-oriented approaches (I believe the former have potentially better sustainability, and the latter have potentially better domain scientist data curation skills). The report "Towards the Australian Data Commons: A proposal for an Australian National Data Service" seems to suggest a federated approach, which on the face of it could combine the strengths of both approaches. However, given the clashes of value and approach that so often occur in distributed organisations, how well do the speakers believe such approaches would fare in their culture?
At the moment it is important to begin establishing a more comprehensive data infrastructure. But how do the speakers envisage sufficient community value to be created for the ongoing bills to be payable (ie how will sustainability be assured in the longer term)? Note in this context the decision by AHRC to cease funding AHDS...

These questions could also perhaps be on the table in the penultimate session of the day, a discussion on Global Infrastructure development...

Tuesday 11 December 2007

3rd International Digital Curation Conference

The 3rd International Digital Curation Conference is getting underway about now with a pre-conference reception at the National Museum of the American Indian in Washington DC. We are very pleased that it is fully booked, and the Programme Committee have put together an excellent programme. I'm hoping that the presentations will be mounted on the web site very quickly[*], and also that some of those present will be blogging the event (if you do, folks, please tag with IDCC3 so that I can find them).

Yes, as you may have guessed, unfortunately I can't be there either this year, which right now makes me hopping mad (madder than a cut snake, as my erstwhile compatriots used to say).
So have a good one, guys!!!

[*Added later: perhaps slides could be added to Slideshare? Again with tag IDCC3...]

Monday 10 December 2007

Two fundamentally different views on data curation

A few months ago, we interviewed two scientists for two quite different posts in the DCC. Both were from a genomics background, and I was very struck by how strongly held, but from my point of view how narrow, was their view of curation. As a result, I’m beginning to realise that there are two fundamentally different approaches to data curation

The Wikipedia definition of biocurator was quoted by Judith Blake, the keynote speaker [slides] at the 2nd International BioCuration Meeting in San Jose:

“A biocurator is a professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases. The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database inter-operability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories.

Biocurators (also called scientific curators, data curators or annotators) have been recognized as the ‘museum catalogers of the Internet age’ [Bourne & McEntyre]”

In essence, this suggests that curation (let’s call it biocuration for now) is the construction of authoritative annotations linking significant objects (eg genes) with evidence about them in the literature and elsewhere. And this is indeed the sort of thing you see in genomics databases.

In other parts of science, I believe curation is not so much about constructing annotation on objects, but rather about caring for arbitrary datasets (adding descriptions, information about how to use them, transferring them to longer term homes, clarifying conditions of use and provenance information, but yes including annotations, etc), so that they can be used and re-used, by the originators or others, now or in the future. This approach incorporates elements of digital preservation, but is not solely defined by “long term” in the way that digital preservation tends to be. Clearly we will find cases where we need to curate datasets which themselves result from biocuration!

The problem is that both terms are well-embedded in their respective communities. I don’t think either of my two interviewees had any idea how much broader our concept was than theirs. We need to watch out for the inevitable misunderstandings that can arrive from this clash of meanings! However, both communities are likely to continue to use the terms in their own ways (although perhaps this term biocurator, relatively new to me, is an effort to disambiguate).

Sunday 9 December 2007

Posting gap… and intriguing article on “borrowed data”

Apologies to those of you who were interested in this blog; there has been only one posting since the end of August. Shortly after that I went off-air after an accident, and since then I have been recovering. Although not yet officially back at work, I hope I have now reached the point where I can start making some postings again.

To alleviate potential boredom, a friend gave me some back issue of New Scientist to read. The first one I opened was the issue for 20 January, 2007. On page 14, the first paragraph of an article titled “Loner stakes claim to gravity prize” jumped out at me. It read:

“A lone researcher working with borrowed data may have pipped a $700 million NASA mission to be the first to measure an obscure subtlety of Einstein’s general theory of relativity.”

The thrust of the New Scientist piece is that this “lone researcher” had re-analysed another scientist’s analysis of the orbit of a NASA satellite of Mars, and claimed to have found evidence of the Lense-Thirring effect, a twisting of space-time near a large rotating mass, ahead of NASA’s own expensive mission.

Borrowed data? This had to be worth following up! The relevant article is “Testing frame-dragging with the Mars Global Surveyor spacecraft in the gravitational field of Mars” by Lorenzo Iorio, available from http://arxiv.org/abs/gr-qc/0701042. If you check that reference you may note that 3 versions of the article had been deposited by the cover date of the New Scientist piece, but that he is now up to version 10, deposited on 14 May 2007. Checking the “cited by” citations in SLAC-SPIRES HEP shows that this article has been very controversial, with many comments and replies to comments, apparently contributing to the development of the article (although the authors of these comments don’t get any acknowledgments as far as I could see). Nevertheless, it does look like a nice example of the value of an open approach to science. I don’t know how much the final article is now accepted, although it does seem now to form part of a book chapter (L. Iorio (ed.) The Measurement of Gravitomagnetism: A Challenging Enterprise, Chap. 12.2, NOVA Publishers, Hauppauge (NY), 2007. ISBN: 1-60021-002-3).

What about the borrowed data? The Acknowledgments section has: “I gratefully thank A. Konopliv, NASA Jet Propulsion Laboratory (JPL), for having kindly provided me with the entire MGS data set”. He does not give an explicit data citation. It appears however, that Iorio is not referring to the publicly accessible science results accessible from JPL. In section 2, he writes:

“In [24] six years of MGS Doppler and range tracking data and three years of Mars Odyssey Doppler and range tracking data were analyzed in order to obtain information about several features of the Mars gravity field summarized in the global solution MGS95J. As a by-product, also the orbit of MGS was determined with great accuracy.”

Reference 24 is “Konopliv A S et al 2006 Icarus 182 23”, which appears to be “A global solution for the Mars static and seasonal gravity, Mars orientation, Phobos and Deimos masses, and Mars ephemeris”, by Alex S. Konopliv, Charles F. Yoder, E. Myles Standish, Dah-Ning Yuan and William L. Sjogren., in Icarus Volume 182, Issue 1, May 2006, Pages 23-50. In turn, this paper acknowledges “Dick Simpson and Boris Semenov provided much of the MGS and Odyssey data used in this paper mostly through the PDS archive”, and this time there is a relevant data citation: “Semenov, B.V., Acton Jr., C.H., Elson, L.S., 2004a. MGS MARS SPICE KERNELS V1.0, MGS-M-SPICE-6-V1.0. NASA Planetary Data System”. However this presumably represents the source data from which Konopliv et al did their calculations.

(BTW I have yet to find a formal place where JPL PDS define their requirements for data citations, but there are a couple of examples in http://pds.jpl.nasa.gov/documents/pag/MERDSC.doc, and the citation above sticks to that guideline.)

It looks like the “borrowed data” is in fact provided on a colleague-to-colleague basis. It is clear from some of the replies that there was correspondence between Iorio and Konopliv on some matters of interpretation.

Nevertheless, this seems like a couple of useful examples of science being discovered from the analysis of original and derived datasets established for another purpose, and hence of relevance to data curation. And who knows, maybe the $700 million Gravity Probe mission launched by NASA will turn out not to have been necessary after all?

BTW I tried to follow this story through New Scientist’s online service, to which my University has a subscription. I was asked for my ATHENS credentials, which I provided, but was then thrown out on an IP address check, despite using a VPN. How not to get good use of your magazine!

Digital Curation Blog