Digital Curation Blog: 2007

Sunday, 16 December 2007

L'enfer, c'est les metadata des autres...?

Some reflections and connections on IDCC3, where it was said that l’enfer, c’est les metadata des autres…
As a venue for the IDCC3 pre-conference reception, the National Museum of the American Indian would be hard to surpass, but on Tuesday evening I found myself reflecting on its relevance to an international conference dedicated to the more contemporary challenges of digital curation – assuming, of course, that relevance was a legitimate preoccupation when we were gathered in the midst of such a compelling array of historical artefacts. The connection soon became obvious when, having climbed up through the circuitous galleries, and experiencing an almost seamless transition from watercraft to weaponry to kitchenware and costume, the sense of continuity in the culture of the Indian nations, preserved here and made so very accessible, forged the link with our own gathering. Here, for example, were ceremonial dresses from past centuries displayed alongside their counterparts from the present day, and the motifs from artwork used to celebrate births, battles and harvests from as long ago as 400 years were repeated in the modern representations of the cultural tradition. The only recognisable difference was the brightness of the new beads and the tautness of the fresh braiding.
So what had sustained this cohesion of style and story across several epochs? – a powerful sense of a shared identity, for one, an identity strengthened as the nations of the Chesapeake joined together in the face of massive change, which was being forced upon them by the invasion from Europe. And then there was the simple opportunity to preserve and replicate the physical totems that were important to them for practical or spiritual reasons, three-dimensional objects that could be passed from one hand to another, along with the thoughts and words that were carried by an unbroken oral tradition.

So is that all it took? – that and the dedication of the Museum’s curators, of course. The benefits of a very visible cultural record, the devotion of a community with an allegiance to that culture, it all seemed a long way away from the ‘inhuman’ world described by Rhys Francis the following day, when he referred to the scale of the data deluge as being outside all previous human experience, so vast but invisible.

In his own presentation about the Arts & Humanities, later on Day 1, Myron Gutmann made a valuable contribution to the definition of curation when he referred to it as the preparation of data for preservation and re-use. It also marked the difference between what we were discussing and the manner in which the Indian cultural record had survived. There had been no thoughts of preparation there, only a sense of being. Yet while we might surmise that they had observed no pressing need to devise rules and organisation and standards to ensure preservation and re-use, not when everything was all so familiar and tangible, we can celebrate their concern to preserve the metadata – the designs for the war bonnets, their patterns for the rain dance gowns – and this was metadata with a common currency, free from the threat posed in Clifford Lynch’s reworking of Sartre’s definition of hell.

In that closing keynote speech on Day 2, Clifford Lynch exhorted us to be mindful of the several aspects of scale. Whereas there are huge amounts and types of data being generated, we must take care to understand the characteristics and value of data sets, irrespective of their individual scale, for one person’s Gigabyte may be another’s Petabyte, and data should be selected for preservation with reference to criteria other than mere size. Perhaps, he suggested, one solution that begs more serious consideration is the option to re-compute as an alternative to preservation – just keep the metadata, in a repository of course, not hell – although that strategy would assume the enabling software will remain functional. In parallel to this possibility he pointed out that what we hadn’t discussed over the past two days was loss – i.e. what could we afford to lose? It struck me that perhaps whilst that would be a difficult luxury for our own time it was an even greater imperative for the Indian nations. Reduced and dispersed from their homelands they had already lost much, but there was a feeling that there was yet more that could not be broken.

Carol Goble had opened Day 2 of the conference by declaring that in the realm of digital data we do not have a stable identity or environment. She referred to the traditionally selfish culture of scientific research as Me-Science, and pleaded for a move to We-Science, where there would be true collaboration and sharing, with the reinforcement of tribal bonding and the crossing of tribal boundaries. Had she too been reflecting on the history of the North American Indian, the joining together of the Nanticoke, Piscataway and all those other tribes as the wooden ships from Europe set anchor in the Chesapeake? Or was this simply an axiomatic reflection of the way things always shake down in the end?

At this conference we heard a lot about who should be doing what to ensure the embedding of good curation methodology, standards and processes in the world of e-science. At times it seemed overly top down and fraught with the risk of being unworkable on the grounds of insufficient investment or buy-in from scientific practitioners. We also heard about smart new solutions that were bubbling up from within the e-science community, fit for purpose, potentially short term, and with the credentials that they had been put together by the discipline communities themselves, in order to resolve some very real issues that they were facing.

Oh well, that brings me back to the Indians again…

Friday, 14 December 2007

Murray-Rust on Digital Curation Conference day 2

Peter Murray-Rust has written two blog posts (here and here; I'm not sure if those are permanent URL's...) about day 2 of the International Digital Curation Conference in Washington DC. Thanks, Peter.

In the first post, he began:

"There is a definitely an air of optimism in the conference - we know the tasks are hard and very very diverse but it’s clear that many of them are understood."

He then picked up on Carole Goble's presentation on workflows. Here are a few random extracts from Peter's random jottings (his description):

"The great thing about Carole is she’s honest. Workflows are HARD. They are expensive. There are lots of them. Not of them does exactly what you want. And so on. [PMR: We did a lot of work - by our standards - on Taverna but found it wasn’t cost-effective at that stage. Currently we script things and use Java. Someday we shall return.]"

"myExperiment.org. A collaborative site for workflows. You can go there and find what you want (maybe) and find people to talk to. “- bazaar for workflows, encapsulated objects (EMO) single WFs or collections, chemistry data with blogged log book, encapsulatd experimental objects Open Linked Data linked initiative…"

"Scientists do not collaborate - scientists would rather share a toothbrush [...] than gene names (Mike Ashburner)
who gets the credit? - who is allowed to update?. Changing metadata rather than data. Versioning. Have to get credit and reputation managed. Scientitsts are driven by money, fame, reputation, fear of being left behind"

"Annotations are first class citizens"

His second post covers Jane Hunter and Kwok Cheung's presentation on compound document objects (CDOs):

"Increasing pressure to share and publish data while maintaining competitiveness.
Main problem lack of simple tools for recording, publishing, standards.
What is the incentive to curate and deposit? What granularity? concern for IP and ownership"

"Current problems with traditional systems - little semantic relationship, little provenance, little selectivity, interactivity , flexibility and often fixed rendering and interfaces. No multilevel access. either all open or all restricted
usually hardwired presentation"

"Capture scientific provenance through RDF (and can capture events in physical and digital domain)
Compound Digital Objects - variable semantics, media, etc.
Typed relationships within the CDOs. (this is critical)"

"SCOPE [the tool Jane & Kwok have developed is] a simplified tool for authoring these objects. Can create provenance graphs. Infer types as much as possible. RSS notification. Comes with a graphical provenance explorer."

Thanks again, Peter!

Thursday, 13 December 2007

Murray-Rust on Digital Curation Conference day 1

Peter Murray-Rust has blogged on our conference in Washington DC:

Overall impressions - optimistic spirit with some speakers being very adventurous about what we can and should do.

He paid particular attention to the national perspective from Australia:

"…Rhys Francis (Australia) . One of the most engaging presentations. Why support ICT? It will change the world. Systems can make decisions, electronic supply chains. humans cannot keep pace. We do not need people to process information.

Who owns the data, and the copyright = physics says, who cares? If it’s good, copy it. Else discard it. Storage is free. [PMR: let’s try this in chemistry…]

What to do about data is a harder question than how to build experiements

FOUR components to infrastructure. I’ve seen this from OZ before… data, compute, interoperation, access. Must do all 4. -

And he also highlighted the importance of domain knowledge in preference to institutional repositories (I go along with that).

And I thought this was interesting, relating to Tim Hubbard's presentation:

"particular snippet: Rfam (RNA family) is now in Wikipedia as primary resource with some enhancement by community annotation. So we are seeing Wikipedia as the first choice of deposition for areas of scholarship."

So it looks like things are going well!

Wednesday, 12 December 2007

DCC Forum: Comment on Conference best paper

There is a Conference theme over on the DCC Forum. However, despite its RSS feed, it doesn't appear to be quite in the blogosphere, so I thought I would post some quotes from it here. Bridget Robinson announced the best paper selection (it gets a star spot in the programme):

"The title of best peer-reviewed paper has been awarded to "Digital Data Practises and the Long Term Ecological Research Programme" by Helena Karasti (Department of Information Processing, University of Oulu in Finland), Karen S Baker (Scripps Institution of Oceanography, University of California, USA) and Katharina Schleidt (Department of Data and Diagnoses, Umweltbundesant, Austria)."

Angus Whyte commented:

"Can't make it to Washington, but looking forward to seeing this paper. I'm familiar with the 'Enriching the Notion of Data Curation in e-Science' paper by the same authors (plus Eija Halkola) in the journal Computer Supported Cooperative Work last year [*}. (doi:10.1007/s10606-006-9023-2); and some of Helena Karasti's previous work to integrate ethnographic and participatory design methods in CSCW. Recently colleagues and I at DCC began case studies for the SCARP project, which I hope can draw lessons from their work to document "the inter-related and continuously changing nature of technology, data care, and science conduct" as they put it.

* Volume 15, Number 4 / August, 2006

If there are any coments here I will re-post a summary back on the Forum!

Anticipating day 1 of the International Digital Curation Conference

Starting in 3 hours or so in Washington DC. I'm quite interested in the various different national approaches being discussed in the morning. Two questions I would like to ask:

There is a clash between institutional and subject-oriented approaches (I believe the former have potentially better sustainability, and the latter have potentially better domain scientist data curation skills). The report "Towards the Australian Data Commons: A proposal for an Australian National Data Service" seems to suggest a federated approach, which on the face of it could combine the strengths of both approaches. However, given the clashes of value and approach that so often occur in distributed organisations, how well do the speakers believe such approaches would fare in their culture?
At the moment it is important to begin establishing a more comprehensive data infrastructure. But how do the speakers envisage sufficient community value to be created for the ongoing bills to be payable (ie how will sustainability be assured in the longer term)? Note in this context the decision by AHRC to cease funding AHDS...

These questions could also perhaps be on the table in the penultimate session of the day, a discussion on Global Infrastructure development...

Tuesday, 11 December 2007

3rd International Digital Curation Conference

The 3rd International Digital Curation Conference is getting underway about now with a pre-conference reception at the National Museum of the American Indian in Washington DC. We are very pleased that it is fully booked, and the Programme Committee have put together an excellent programme. I'm hoping that the presentations will be mounted on the web site very quickly[*], and also that some of those present will be blogging the event (if you do, folks, please tag with IDCC3 so that I can find them).

Yes, as you may have guessed, unfortunately I can't be there either this year, which right now makes me hopping mad (madder than a cut snake, as my erstwhile compatriots used to say).
So have a good one, guys!!!

[*Added later: perhaps slides could be added to Slideshare? Again with tag IDCC3...]

Monday, 10 December 2007

Two fundamentally different views on data curation

A few months ago, we interviewed two scientists for two quite different posts in the DCC. Both were from a genomics background, and I was very struck by how strongly held, but from my point of view how narrow, was their view of curation. As a result, I’m beginning to realise that there are two fundamentally different approaches to data curation

The Wikipedia definition of biocurator was quoted by Judith Blake, the keynote speaker [slides] at the 2nd International BioCuration Meeting in San Jose:

“A biocurator is a professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases. The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database inter-operability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories.

Biocurators (also called scientific curators, data curators or annotators) have been recognized as the ‘museum catalogers of the Internet age’ [Bourne & McEntyre]”

In essence, this suggests that curation (let’s call it biocuration for now) is the construction of authoritative annotations linking significant objects (eg genes) with evidence about them in the literature and elsewhere. And this is indeed the sort of thing you see in genomics databases.

In other parts of science, I believe curation is not so much about constructing annotation on objects, but rather about caring for arbitrary datasets (adding descriptions, information about how to use them, transferring them to longer term homes, clarifying conditions of use and provenance information, but yes including annotations, etc), so that they can be used and re-used, by the originators or others, now or in the future. This approach incorporates elements of digital preservation, but is not solely defined by “long term” in the way that digital preservation tends to be. Clearly we will find cases where we need to curate datasets which themselves result from biocuration!

The problem is that both terms are well-embedded in their respective communities. I don’t think either of my two interviewees had any idea how much broader our concept was than theirs. We need to watch out for the inevitable misunderstandings that can arrive from this clash of meanings! However, both communities are likely to continue to use the terms in their own ways (although perhaps this term biocurator, relatively new to me, is an effort to disambiguate).

Sunday, 9 December 2007

Posting gap… and intriguing article on “borrowed data”

Apologies to those of you who were interested in this blog; there has been only one posting since the end of August. Shortly after that I went off-air after an accident, and since then I have been recovering. Although not yet officially back at work, I hope I have now reached the point where I can start making some postings again.

To alleviate potential boredom, a friend gave me some back issue of New Scientist to read. The first one I opened was the issue for 20 January, 2007. On page 14, the first paragraph of an article titled “Loner stakes claim to gravity prize” jumped out at me. It read:

“A lone researcher working with borrowed data may have pipped a $700 million NASA mission to be the first to measure an obscure subtlety of Einstein’s general theory of relativity.”

The thrust of the New Scientist piece is that this “lone researcher” had re-analysed another scientist’s analysis of the orbit of a NASA satellite of Mars, and claimed to have found evidence of the Lense-Thirring effect, a twisting of space-time near a large rotating mass, ahead of NASA’s own expensive mission.

Borrowed data? This had to be worth following up! The relevant article is “Testing frame-dragging with the Mars Global Surveyor spacecraft in the gravitational field of Mars” by Lorenzo Iorio, available from http://arxiv.org/abs/gr-qc/0701042. If you check that reference you may note that 3 versions of the article had been deposited by the cover date of the New Scientist piece, but that he is now up to version 10, deposited on 14 May 2007. Checking the “cited by” citations in SLAC-SPIRES HEP shows that this article has been very controversial, with many comments and replies to comments, apparently contributing to the development of the article (although the authors of these comments don’t get any acknowledgments as far as I could see). Nevertheless, it does look like a nice example of the value of an open approach to science. I don’t know how much the final article is now accepted, although it does seem now to form part of a book chapter (L. Iorio (ed.) The Measurement of Gravitomagnetism: A Challenging Enterprise, Chap. 12.2, NOVA Publishers, Hauppauge (NY), 2007. ISBN: 1-60021-002-3).

What about the borrowed data? The Acknowledgments section has: “I gratefully thank A. Konopliv, NASA Jet Propulsion Laboratory (JPL), for having kindly provided me with the entire MGS data set”. He does not give an explicit data citation. It appears however, that Iorio is not referring to the publicly accessible science results accessible from JPL. In section 2, he writes:

“In [24] six years of MGS Doppler and range tracking data and three years of Mars Odyssey Doppler and range tracking data were analyzed in order to obtain information about several features of the Mars gravity field summarized in the global solution MGS95J. As a by-product, also the orbit of MGS was determined with great accuracy.”

Reference 24 is “Konopliv A S et al 2006 Icarus 182 23”, which appears to be “A global solution for the Mars static and seasonal gravity, Mars orientation, Phobos and Deimos masses, and Mars ephemeris”, by Alex S. Konopliv, Charles F. Yoder, E. Myles Standish, Dah-Ning Yuan and William L. Sjogren., in Icarus Volume 182, Issue 1, May 2006, Pages 23-50. In turn, this paper acknowledges “Dick Simpson and Boris Semenov provided much of the MGS and Odyssey data used in this paper mostly through the PDS archive”, and this time there is a relevant data citation: “Semenov, B.V., Acton Jr., C.H., Elson, L.S., 2004a. MGS MARS SPICE KERNELS V1.0, MGS-M-SPICE-6-V1.0. NASA Planetary Data System”. However this presumably represents the source data from which Konopliv et al did their calculations.

(BTW I have yet to find a formal place where JPL PDS define their requirements for data citations, but there are a couple of examples in http://pds.jpl.nasa.gov/documents/pag/MERDSC.doc, and the citation above sticks to that guideline.)

It looks like the “borrowed data” is in fact provided on a colleague-to-colleague basis. It is clear from some of the replies that there was correspondence between Iorio and Konopliv on some matters of interpretation.

Nevertheless, this seems like a couple of useful examples of science being discovered from the analysis of original and derived datasets established for another purpose, and hence of relevance to data curation. And who knows, maybe the $700 million Gravity Probe mission launched by NASA will turn out not to have been necessary after all?

BTW I tried to follow this story through New Scientist’s online service, to which my University has a subscription. I was asked for my ATHENS credentials, which I provided, but was then thrown out on an IP address check, despite using a VPN. How not to get good use of your magazine!

Wednesday, 31 October 2007

Digital Curation Centre/London eScience Centre Collaborative Workshop

A post from Graham Pryor (DCC eScience Liaison)

The format of collaborative workshops having been set before I picked up the DCC’s eScience Liaison mantle in July this year, I was of course eager to match the success of previous events. My first, with the London eScience Centre (LeSC), was scheduled for 16th October, and in the spirit of exchanging knowledge and expertise, a clutch of presentations and demonstrations by DCC staff was assembled that could be expected to showcase the range of activities and services in which we are engaged. But how would they meld with contributions from the LeSC and how appropriate would be the ‘collaboration’?

I had no cause for concern. The LeSC, now transformed in name and context as the Imperial College Internet Centre, had welcomed this event, producing a spectrum of topics and speakers ranging from an exposition of the College’s ICT and Library collaboration on digital repositories to Henry Rzepa’s demonstration of a multifaceted digital repository for chemistry and Omer Casher’s ‘Semantic Eye’. Opening the workshop, the Internet Centre’s Director, John Darlington, described its new remit as embracing the gamut of Internet-enabled research and development, a significantly broader mission than that had been suggested by the superseded title of London eScience. In fact, the agenda had been evolving up to the last minute, emerging as a very full and lengthy event that overturned the format of previous events by displacing the concluding plenary discussion, but with questions allowed for after each presentation we did perhaps enjoy a greater sense of immediacy and relevance in the lively debates that ensued. (Presentations from the workshop are to be made available at the Internet Centre’s Web pages)

For the DCC, these discussions proved to have particular relevance to the compelling subject of the project’s sustainability, as we found ourselves dealing with questions regarding the range and level of service provision that could be expected from the DCC – e.g. ‘when a project is at start-up can the investigators rely on the provision of an advisory service covering legal and other core data curation issues?’ – which very much reflects the shift in emphasis of DCC Phase 2.

Closing the workshop, John Darlington judged the event to have been a success, having brought together not only the Internet Centre and the DCC, but also colleagues from within Imperial College who might otherwise not have an opportunity to discuss areas of shared interest. His perception was that the workshop had fully supported the rationale of the Internet Centre and he was enthusiastic to maintain the impetus for the exchange of knowledge that it had provided. Upon reflection, that is another box to be ticked on the account of DCC service provision, but the larger message of how should we focus our portfolio for future services and support is one that I took back with me.

Friday, 31 August 2007

Archiving Service

The other day I had an interesting discussion with a group of IT facilities management guys here at Edinburgh, who have been asked to write the requirements for a new version of their Archive Service, capable of handling modern requirements for huge amounts of data. It was an interesting discussion, and I hope to remain involved in what they are planning (which is a long way from getting funded, I guess). They had some good ideas, thinking perhaps derived in part from software management systems, of checking data in and out, managing versions, etc.

We spoke a bit about OAIS, and about the issues of making data understandable for the long term. They were somewhat reluctant to go there, not surprisingly; general IT attitude to "archiving" in the past has tended to be something like: "You give me your blob of data, and I'll give you back an identical blob of data at some future time". IT people are pretty comfortable about that approach (although I also mentioned some emerging issues such as those mentioned by David Rosenthal in his blog post about keeping very large data for very long times).

We also discussed the need to make the data accessible from outside the University, to allow it to be linked from publications. This too was a bit outside their previous remit.

It struck me afterwards how much the conversation was affected by the IT services point of view (I'm not trying to denigrate these individuals at all; as far as I can see they took it all on board). I have talked with a lot of people about "digital preservation" or "repositories"; in most of those conversations, the issues about managing the bits themselves are assumed to be pretty much a solved problem, and the conversation takes different forms, about metadata, about formats, about representation information, about emulation versus migration, and so on. The participants tend to be linked to the library, or to some subject area, but rarely from IT.

I wondered if it was possible to imagine the IT guys providing a secure substrate on which a repository service sits, with independence between the two; that would allow everyone to stay in their comfort zone. I don't think OAIS would quite work in those terms, and I think there would be dependencies both ways, although I'll have to think more about this. The only counter example I have heard of is a rumour of Fedora providing high level services based on SRB as a secure storage substrate, although I don't have a reference.

What was also interesting is how few examples we could think of in Universities, of large scale, long term archiving systems for records and data, leaving aside the publication-oriented repository movement.

Can any readers give us pointers?

Wednesday, 29 August 2007

IJDC again

At the end of July I reported on the second issue of the International Journal of Digital Curation (IJDC), and asked some questions:

"We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

"For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet."

I didn't get any responses, but over on the eFoundations blog, Andy Powell was taking us to task for only offering PDF:

"Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy."

His blog is more widely read than this one, and he attracted 11 comments! The gist of them was that PDF plus HTML (or preferably XML) was the minimum that we should be offering. For example, Chris Leonard [update, not Tom Wilson! See end comments] wrote:

"People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions availble for precisely this purpose."

Cornelius Puschmann [update, not Peter Sefton] wrote:

"Yeah, but if you really want semantic markup why not do it right and use XML? The problematic thing with OJS (at least to some extent) is/was that XML article versions are not the basis for the "derived" PDF and HTML, which deal almost purely with visuals. XML is true semantic markup and therefore the best way to store articles in the long term (who knows what formats we'll have 20 years from now?). HTML can clearly never fill that role - it's not its job either. From what I've heard OJS will implement XML (and through it neat things such as OpenOffice editing of articles while they're in the workflow) via Lemon8 in the future."

Bruce D'Arcus [update, not Jeff] says:

"As an academic, I prefer the XHTML + PDF option myself. There are times I just want to quickly view an article in a browser without the hassle of PDF. There are other times I want to print it and read it "on the train."

"With new developments like microformats and RDFa, I'd really like to see a time soon where I can even copy-and-paste content from HTML articles into my manuscripts and have the citation metadata travel with it."

Jeff [update, not Cornelius Puschmann] wrote:

"I was just checking through some OJS-based journals and noticed that several of them are only in PDF. Hmmm, but a few are in HTML and PDF. It has been a couple of years since I've examined OJS but it seems that OJS provides the tools to generate both HTML and PDF, no? Ironically, I was going to do a quick check of the OJS documentation but found that it's mostly only in PDF!

"I suspect if a journal decides not to provide HTML then it has some perceived limitations with HTML. Often, for scholarly journals, that revolves around the lack of pagination. I noticed one OJS-based journal using paragraph numbering but some editors just don't like that and insist on page numbers for citations. Hence, I would be that's why they chose PDF only."

I think in this case we used only PDF because that was all our (old) version of the OJS platform allowed. I certainly wanted HTML as well. As I said before, we're looking into that, and hope to move to a newer version of the platform soon. I'm not sure it has been an issue, but I believe HTML can be tricky for some kinds of articles (Maths used to be a real difficulty, but maybe they've fixed that now).

I think my preference is for XHTML plus PDF, with the authoritative source article in XML. I guess the workflow should be author-source -> XML -> XHTML plus PDF, where author-source is most likely to be MS Word or LaTeX... Perhaps in the NLM DTD (that seems to be the one people are converging towards, and it's the one adopted by a couple of long term archiving platforms)?

But I'm STILL looking for more concrete ideas on how we should co-present data with our articles!

[Update: Peter Sefton pointed out to me in a comment that I had wrongly attributed a quote to him (and by extension, to everyone); the names being below rather than above the comments in Andy's article. My apologies for such a basic error, which also explains why I had such difficulty finding the blog that Peter's actual comment mentions; I was looking in someone else's blog! I have corrected the names above.

In fact Peter's blog entry is very interesting; he mentions the ICE-RS project, which aims to provide a workflow that will generate both PDF and HTML, and also bemoans how inhospitable most repository software is to HTML. He writes:

"It would help for the Open Access community and repository software publishers to help drive the adoption of HTML by making OA repositories first-class web citizens. Why isn't it easy to put HTML into Eprints, DSpace, VITAL and Fez?

"To do our bit, we're planning to integrate ICE with Eprints, DSpace and Fedora later this year building on the outcomes from the SWORD project – when that's done I'll update my papers in the USQ repository, over the Atom Publishing Protocol interface that SWORD is developing."

So thanks again Peter for bringing this basic error to my attention, apologies to you and others I originally mis-quoted, and I look forward to the results of your efforts! End Update]

OAIS review: what's happening?

In June 2006, there was an announcement:

In compliance with ISO and CCSDS procedures, a standard must be reviewed every five years and a determination made to reaffirm, modify, or withdraw the existing standard. The “Reference Model for an Open Archival Information System (OAIS)” standard was approved as CCSDS 650.0-B-1 in January 2002 and was approved as ISO standard 14721 in 2003. While the standard can be reaffirmed given its wide usage, it may also be appropriate to begin a revision process. Our view is that any revision must remain backward compatible with regard to major terminology and concepts. Further, we do not plan to expand the general level of detail. A particular interest is to reduce ambiguities and to fill in any missing or weak concepts. To this end, a comment period has been established.

Comments were required by 30 October 2006. The Digital Preservation Coalition and the Digital Curation Centre ran a joint workshop on 13 October in Edinburgh, and as a result submitted joint comments. Some of these comments were minor but some were quite significant. There are 14 general recommendations, and many detailed updates and clarifications were suggested.

Supporters of OAIS (and I am one) often make a great play about how open their process was. In that spirit, I have tried to find out what is happening, and to take part. I understood there was to be an open process, with a wiki and telecons, to decide in the first place whether the standard needs revision, and if so to revise it. Despite numerous attempts, I cannot find out what the current state is, or where or how this is taking place. Does anyone know?

This is a very important standard, and it needs revision to make it more useful in today's environment. We need to make it as good as we can.

Wednesday, 22 August 2007

Proportion of research output as data or publication?

My colleague Graham Pryor asks in an email:

"Chris - I have been looking for evidence of the proportion of UK research output that can be categorised as scholarly publications and that which is in the form of data. I have found nothing. It is quite possible that no-one has ever tried to work this out. However, on the off-chance, is this a figure (even an estimate) that you might have come across?"

I think this is a great question, even an "Emperor's New Clothes" question. "Everyone knows" that the data proportion has been increasing, but I know of no estimates of the proportions.

Does anyone else know of such estimates? If not, does anyone have any idea how to set about making such an estimate? A very clear definition of data would be needed, to include externally usable data perhaps, rather than raw telemetry...

(Apologies for the gap in posting, I have been sequestered on a Greek island for a couple of weeks, and not thinking of you at all!)

Tuesday, 31 July 2007

IJDC Issue 2

I am very happy that Issue 2 of the International Journal of Digital Curation (IJDC) is now out. This open access journal issue contains 7 peer-reviewed papers and 7 general articles.

I am listed as editor, and did write the editorial, but Richard Waller of UKOLN did all the hard editorial work, in between his day job on Ariadne. Not to mention our authors, of course! My heartfelt thanks to all concerned.

We have a few articles in the pipeline for Issue 3, but I do want to get up a good head of steam. So if you, dear reader, care about digital curation, the care and feeding of science data, the relationship of data to publication, the rights and wrongs of access to data, and/or digital preservation, and have some research results to put forward, please do write us a paper or an article!

We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet.

If anyone knows how to do this, do please get in touch!

Wednesday, 25 July 2007

Digital Curation Conference: Remember the Call for Papers

There is less than 3 weeks to go before the Call for Papers for the 3rd International Digital Curation Conference closes, on 15 August 2005. I do want to encourage researchers to submit research papers on digital curation to this conference. We had a good set of papers last year in Glasgow, and we are hoping for an even stronger field this year, when the Conference will be held in Washington DC. We are trying to get funding to support some travel, especially for younger speakers.

Tuesday, 24 July 2007

Question on approaches to curating textual material

Dave Thompson, Digital Curator at the Wellcome Library asked a question on the Digital Preservation list (which is not well set up for discussion just now). I've replied, but we agreed I would adapt my reply for the blog for any further discussion that might emerge.

"I'm looking for arguments for and against when, and if, digital material should be normalised. I'm thinking about the long term management of textual material in proprietary formats such as MS Word. I see three basic approaches on which I'm seeking the lists comments and thoughts.

The first approach normalises textual material at the point of ingestion, converting all incoming material to a neutral format such as XML immediately. This would create an open format manifestation with the aim of long term sustainable management.

The second approach would be one of 'wait and see', characterised by recognising that if a particular format isn't immediately 'at risk' of obsolescence why touch it until some form of migration becomes necessary at some future point.

The third approach preserves the bitstream as acquired and delivers it in an unmodified form upon request, ie MS Word in – MS Word out.

The first approach requires tools, resources and investment immediately. The second requires these same resources, and possibly more, in the future. The future requirements for the third approach are perhaps unknown aside from that of adequate technical metadata.

I'm interested in ideas about the sustainability of these approaches, the costs of one approach over the other and the perceived risks of moving material to an open format sooner rather than later. I'd be very interested in examples of projects which have taken either approach."

Dave, the questions you ask have been rumbling on for years. The answers, reasonably enough, keep changing. Partly depending on who asks and who answers, but also depending on the time and the context. So that's a lot of help isn't it?

You might want to look at a posting in David Rosenthal's blog on Format Obsolescence as the Prostate Cancer of Preservation (for younger and/or non-male curators, the reference is that many more men die WITH prostate cancer than die because of it.) Lots of food for thought there, and some of the same themes I was addressing in my Ariadne article a year or so ago.

The simplest answer to your question is "it depends". If you've got lots of money, and given the state of flux right now in the word processing market, I would suggest doing both (1) and (3): that is make sure you preserve your ingested bits un-changed, but also create a "normalised" copy in your favourite open format.

What format should that be? Well for Word at the moment it might be sticky. PDF (strictly PDF/A if we're into preservation) might be appropriate. However as far as ever extracting useful science from the document is concerned, the PDF is a hamburger (as Peter Murray Rust says; he reports Mike Kay as the origin: "Converting PDF to XML is a bit like converting hamburgers into cow"). PDF is useful where you want to treat something exactly as page images; it is also probably much less useful for documents like spreadsheets (where the formulae are important).

Open Document Format is an international standard (ISO/IEC 26300:2006) supported by Open Source code with a substantial user and developer base, so its long term sustainability should be pretty strong. I've heard that there can be glitches in the conversions, but I have no experience (the Mac does not seem to be quite so well served). Office Open XML has been ratified by ECMA, and is moving (haltingly?) towards an ISO standard. Presumably its conversion process will be excellent, but I don't know of much open source code base yet. However the user base is enormous, and MS seems to be getting some messages from its users about sustainability. Nah, right now I would guess ODF wins for preservation.

It may not apply in this case, but often there is a trade-off between the extent of the work you do to ensure preservation (and the complexity and cost of that work), and the amount of stuff you can preserve. Your budget is finite, right? You can't spend the money twice. So if you over-engineer your preservation process you will preserve less stuff. The longevity of the stuff in AHDS, it turns out, was affected much more by a policy change than by any of the excellent work they did preserving it. You need to do a risk analysis to work out what to do (which is not quite the same as a crystal ball; few would have seen the AHRC policy change coming!).

It's also probably true that half or more of the stuff you preserve will not be accessed for a very long time, if ever. Trouble is (as the captains of industry are reported to say about the usefulness of their marketing budgets, or librarians about their acquisitions) you don't know in advance which half.

Greg Janee of the NDIIP NGDA project gave a presentation at the DCC (PPT) a couple of years ago, in which he introduced Greg's equation:

Item is worth preserving for time duration T if:
(intrinsic value) * ProbT(usage) > SumT(preservation costs) + (cost to use)

... ie given a low probability of usage in time T, preservation has to be very cheap!

What I'm arguing for is not putting too much of the cost onto ingest, but leave as much as reasonable to the eventual end user. After all, YOU pay the ingest cost. Strangely, so, in a way, does the potential end user whose stuff was not preserved if you spent too much on ingest. You do need to do enough to make sure that end use is feasible, and indeed appropriate in relation to comparator archives (you don't want to be the least-used archive in the world). You also must include, in some sense or other the Representation Information to make end use possible.

But you don't have to constantly migrate your content to current formats to make it point-and-click available; in fact it may be a disservice to your users to do so. Migration on request has always seemed to me a sensible approach (I think it was first demonstrated by Mellor, Wheatley & Sergeant (Mellor 2002 *) from the CAMiLEON project based on earlier work in the CEDARS project, but also demonstrated by LOCKSS). This seems pretty much your second approach; you just have to ensure you retain a tool that will run in a current (future) environment, able to migrate the information. Unless you have control of the tool, this might suddenly get hard (when the tool vendor drops support for older formats).

I've often thought, for this sort of file type, that something like the OpenOffice.org suite might be the right base for a migration tool. After all, someone's already written the output stage and will keep it up to date. And many input filters have also already been written. If you're missing one, then form a community and write it, presto the world has support for another defunct word processor format (yeah I know, it's not quite that easy!).

I was going to argue against your option 3 (although it's what most repositories do just now). But I think I've talked myself round to it being a reasonable possibility. I would add a watching brief, though: you might decide at some point that the stuff was getting too high risk, and that some kind of migration tool should be provided (in which case you're back to option 2, really).

I get annoyed when I hear people say (what I probably also used to say) that institutional repositories are not for preservation. It's like Not-for-Profit companies; they may not be for profit, but they'd better not be for loss (I used to be on the Board of two). Repositories are not for loss. They keep stuff. Cheaply. And to date, as far as I can see, quite as well as expensive preservation services!

* MELLOR, P., WHEATLEY, P. & SERGEANT, D. (2002) Migration on Request, a Practical Technique for Preservation. Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. Proceedings.

Friday, 20 July 2007

Subject "versus" institutional repositories

There's a concept in maths called "closed but unbounded". I'm not sure it's exactly to the point (I hope that's a pun), but "subjects" seem a bit like that. You can be pretty sure about most of the stuff that's not in a subject (or "domain"), and most of the stuff that is in it, but you can be very puzzled about some of the edges, and can find yourself in some extremely surprising discussions at times about parts of subjects that challenge most of the ideas you had. So subjects turn out to be very un-bounded. (They also tend to fracture, productively.) Perhaps not surprisingly, subjects don't tend to have assets, bank balances, etc. You might say, in those senses, subjects don't exist! They do nevertheless have very real approaches, common standards, ontologies, methods, vocabularies, literatures... and passionate adherents spread across institutions.

Institutions on the other hand, or at least universities, tend to be very material. They do have assets, bank balances, policies, libraries, employees, continuity on a significant scale, even (in the US at least) endowments. They have temporal stability and mass. They collect scholars and scientists in various domains... even if the scientists give their loyalty to their subjects, and are held together only by salaries and a common loathing of the university car parking policy!

Institutions have continuity, and they have libraries, and archives, which in serious ways express that continuity. Libraries are not about print. Libraries are now squarely about knowledge and information expressed in data, whether they know it or not. And the continuity of valuable data is an important reason for libraries to be involved.

But institutions are generic, and libraries are generic, even in more focused institutions like MIT. The library, the archive, the IR, in different ways, are about collecting elements of the scholarly discourse that contribute both globally and locally. So institutional repositories are about generic continuity of data, as libraries are about continuity of collections. IRs create value for the institution, even if it is only a small piece of value (like most other individual "collections" in an institution). If you don't play, you aren't in the game. You know data has value, just not which bits. You need to disclose your scholarly assets, across the spectrum; you can feel proud of doing so, and make a case for local benefit at the same time. You are an institution taking part in a global system; the value may be in the network, but you are part of that network.

But the way an IR treats data is necessarily generic; if you get data from chemistry, engineering, social sciences and performing arts into this under-funded but potentially valuable repository, you will do your best but it will necessarily be variants of generic practice, at best.

So back to the "subject"; if there is a data repository here, it is likely staffed by "domain experts", capable of taking on a "community proxy" role. They know their stuff. They will treat their data in domain-specific ways; they will know where to seek out data to complement their collection, they will know how to make connections between different parts. They can describe it appropriately, they can develop standards with their colleagues. They will know how to help their colleague scientists extract maximum value. Some subject repository managers are seriously concerned about the problems for disciplines if institutional repositories expand into the data "space".

What subject repositories don't usually have, is what institutions have: substantial assets, endowments, bank balances, tenured staff. Usually based around multiple project grants, 5-year core funding is a prized goal at a price of cheese-paring funding, and mid-term reviews every second year. Subject repositories don't have assured continuity, temporal mass.

The NSB LLDDC (Long-lived digital data collections) report, and now the NSF CyberInfrastructure strategy, are aimed at this area; they have spotted the fragility of these subject data collections. In the UK we have possibly even more of a patchwork of funding mechanisms than was observed in the LLDDC report. JISC used to be a significant funder of subject repositories, but in recent years has been retrenching from them, while building up massive funding in IRs. AHRC, as we have seen, is pulling back from funding the AHDS.

So what would make this better? I'd like to see a substantive discussion about the roles and funding mechanisms of subject and institutional repositories. In the UK, this would have to involve at least the Research Councils, Wellcome Trust and JISC. (Perhaps looks less likely than it did when I first wrote this.)

Secondly, I'd like to see JISC in the final tranche of its capital funding (here's the circular recently closed) explore the bounds of what's possible with the data provider/ service provider combination (maybe OAI/ORE will address this a little? Maybe not!). And what if curation is detached from the repository? What if data continuity/preservation is separated from the curation service? Do these questions even make sense?

Maybe a system or federation of sustainable IRs internally divided into sets on subject lines (and hence externally aggregatable along those lines), with subject-oriented curation activities picking up on "invisible college" volunteerism might work? Splitting curation into generic and domain elements... Or other notions, pushing the skill out into the network, the federation, but retaining the data where the assets and continuity lie?

[This posting is based on an email I sent to a closed JISC Repositories advisory group some time ago; it seems even more relevant today...]

Open Data Licensing: is your data safe?

Over on the Nodalities blog, Rob Styles wrote about some of the aspects of open data licensing, and the tricky questions of copyright versus database right. OK, yawn. Let me put that another way… over on the Nodalities blog, Rob Styles writes about whether you can make your data openly accessible on the web without getting totally ripped off in the process. A bit less of a yawn?

One key quote:

“Without appropriate protection of intellectual property we have only two extreme positions available: locked down with passwords and other technical means; or wide open and in the public-domain. Polarising the possibilities for data into these two extremes makes opening up an all or nothing decision for the creator of a database.   

With only technical and contractual mechanisms for protecting data, creators of databases can only publish them in situations where the technical barriers can be maintained and contractual obligations can be enforced.”

It’s true: to put any conditions over the use of our data, we have to have an exclusive right to control it. Copyright gives its owner that right for a text. If I own the Copyright for my works, I can (and try to) put a Creative Commons licence on it, to allow others to use it but to ask them to give me attribution if they do so.

The problem is that there is doubt… OK, more than doubt… whether and/or how Copyright applies to databases. And if Copyright does not apply, you don’t get the exclusive control which allows you to apply a conditional licence like Creative Commons. Just to explore a bit further...

Science Commons was set up to look at helping make science data more openly available. But if you look at their FAQ, you can see some real concerns. They pick out several aspects of a database that might be subject to Copyright, including the structure, but also say:

"In the United States, data will be protected by copyright only if they express creativity. Some databases will satisfy this condition, such as a database containing poetry or a wiki containing prose. Many databases, however, contain factual information that may have taken a great deal of effort to gather, such as the results of a series of complicated and creative experiments. Nonetheless, that information is not protected by copyright and cannot be licensed under the terms of a Creative Commons license."

In a note to me Mags McGinley, our legal officer, re-inforces this, and adds:

"Copyright definitely applies to certain elements of a database. Copyright exists in the structure of a database if, by reason of the selection and arrangement, it constitutes the authors own intellectual creation. In addition the contents of database, depending on what they are, may attract their own copyright protection (a simple example might be a database of poems)."

But is there a glimmer of hope? The Science Commons FAQ goes on to say:

"Note - for databases subject to the laws of members of the European Union and certain other countries, the law supplies a special right for databases. Except in the Netherlands and Belgium Creative Commons Licenses, Creative Commons licenses do not apply to this right..."

Rob Styles also reminds us that in Europe we have this other right: “the EU adopted a robust database right in 1996 while the US ruled against such protection in 1991”.

“Database right in the EU is like Copyright. It is a monopoly, but only on that particular aggregation of the data. The underlying facts are still not protected and there is nothing to stop a second entrant from collecting them independently.”

Charlotte Waelde has written a report for the JISC-funded GRADE project on rights that apply to data in geospatial databases. She concluded that Database Copyright does not apply, but the Database Right does apply. She also concluded (my emphasis):

"• Unauthorised taking and making available of substantial parts of the contents of the database will infringe the right of extraction and re-utilisation"

and...

"• A lawful user of the database (e.g. the researcher or teacher in an educational institution) may not be prevented from extracting and re-utilising an insubstantial part of the contents of a database for any purposes whatsoever.
• A researcher or teacher may not be prevented from extracting a substantial part of the contents of the database for the purposes of non-commercial research or illustration for teaching so long as the source is indicated. Re-utilisation may only be enjoined if the output contains a substantial part of the contents of the protected database"

I am not a lawyer and (try as I might) I couldn't get all the nuances of what she is trying to say, particularly in the last sentence above; however Mags tells me

"The thing there is that there is a difference between extraction and reutilisation which are the two activities that can be prevented by the database right. The fair dealing exceptions for the database right are not as wide as those of copyright and are for some reason limited to the act of extraction."

"So Charlotte is highlighting the maximum you could do in such case where your activities fall within the research/teaching area. This is: extract a substantial part. And then reutilise an insubstantial part (because the database right only limits what you do with substantial parts of the database)."

Rob goes on to end his blog entry, saying of rights:

“They allow inventors to disclose their inventions when they might otherwise have had to keep them secret... That's why we've invested in a license to do this, properly, clearly and in a way that stays Open.”

He is referring to the Talis Community Licence, which attempts to base a conditional open licence on the Database Right. Trust me, I REALLY want this sort of thing to work, but I worry that the Database Right may not be sufficient as underlying protection to make this licence firm. And what would be the law applying to access FROM a jurisdiction like the US that did not have a Database Right?

As I’ve said before, I’m not a lawyer. Can a data-oriented lawyer comment?

Wednesday, 18 July 2007

Arts and Humanities Data Service... next steps?

In an earlier post, I mentioned the decision by the AHRC (and later JISC) to cease funding the
AHDS from March 2008. Since then the AHRC have re-affirmed their decision. On 28 June, the Future Histories of the Moving Image Research Network made public an open letter to the AHRC, to no avail it would appear.

In her response to the announcements, Sheila Anderson (Director of AHDS) wrote:

"In the meantime, and at least until 31st March 2008, the AHDS will continue to give advice and guidance on all matters relating to the creation of digital content arising from or supporting research, teaching and learning across the arts and humanities, including technical and metadata standards and project management. If you have a data creation project, please do not hesitate to contact us for advice.

The AHDS will continue to work with those creating important digital resources to advise on the best methods for keeping these valuable resources available and accessible for the long-term in a form that encourages their further use for answering new research questions, and their use in teaching and learning. This advice will include exploring with content creators and owners suitable repositories in which they might deposit their materials for long term curation and preservation, and how to ensure that their materials can continue to be discovered and used by the wider community. If you are currently in negotiation with the AHDS to deposit your digital collection, please continue to work with us to ensure the future sustainability and accessibility of your resource.

The AHDS will continue to make available its rich collection of digital content for use in research, teaching and learning, and to preserve those collections in its care. The AHDS intends to discuss with the JISC and the AHRC the long term future of these collections beyond April 2008 with the intention of securing their continued preservation and availability."

This is a great expression of commitment, and deserves our support. However, the lack of long-term funding must raise questions of sustainability.

Explicit in the AHRC's decision was the view that the community is mature enough to manage its own resources. There is doubt in many people's minds about this, but we are effectively stuck with it. So what are the implications? There are implications both for existing collections and for future arts and humanities resources. I would like to spend a few paragraphs thinking about the existing collections.

AHDS is not monolithic; it is comprised of several separate services (I suspect in what follows I may be using historical rather than current names). We already know that AHRC is privileging the Archaeology Data Service (ADS), which will continue to receive some funding (and which has a diverse funding base), so their resources are presumably safe. The History Data Service (HDS) resources are embedded within the UK Data Archive (which has recently received an additional 5 years funding from ESRC); it would presumably cost more to de-accession those resources than to keep preserving them and making them available, so even if HDS can take no more resources, the existing ones should be safe. Literature, Language and Linguistics is closely related to the Oxford Text Archive; I imagine the same kinds of arguments would apply there.

I have heard suggestions that Kings College London might continue to support the AHDS Executive for a period, and it appears there are some discussions with JISC about some kind of support "to ensure the expertise and achievements of the AHDS are not lost to the community".

That leaves Performing Arts and Visual Arts; I can't even surmise what their future might be, since I don't know enough about their funding and local environment.

I appreciate that it's still early days, and no doubt crucial discussions are going on behind the scenes. But if any part of AHDS resources are in danger of loss, the resource owners need to consider plans to deal with those resources in the future. This will take some time, particularly since for more complex resources, it is clear that existing repositories are generally NOT yet adequate for purpose. I guess the picture itself will be complex; I can think of at least these categories:

Some resources still exist outside of AHDS, and no action may be needed.
Some resources will not be felt worth re-homing.
Some resources can be re-homed in the time and funding institutions have available.
Some resources should be re-homed, provided that time and funding are provided by some external source (this might be for developments on an institutional repository; it might be for work on the resource to fit a new non-AHDS environment).

I tried to find out what resources were deposited by the University of Edinburgh. It's not an easy search, but ignoring a bunch of stuff where the University was only peripherally mentioned, I came up with:

HOTBED (Handing on Tradition By Electronic Dissemination) pa-1028-1
Lemba Archaeological Project arch-279-1
Gateway to the Archives of Scottish Higher Education (GASHE) exec-1003-1
Avant-Garde/Neo-Avant-Garde Bibliographic Research Database lll-2503-1
Survey of Scottish Witchcraft, 1563-1736 hist-4667-1
National Sample from the 1851 Census of Great Britain hist-1316-1

HOTBED appears on further research to be a RSAMD project that ended in 2004; however, a Google search produced a web site (http://www.hotbed.ac.uk/) that did not appear to be functional this afternoon. GASHE still exists as a working resource, so far as I can tell, and anyway is run from Glasgow. Lemba is archaeology, and so maybe safe. The Avant-Garde bibliography still exists separately. The two history resources seem safe.

Are we OK? Is there more? Who knows! I think we need much better tools to tell what is "at risk", so that plans can start being made. Of course, this could be happening, maybe I'm just not "in the loop".

Will AHRC consider bids for funding transitional work? I certainly hope so, although I don't know how this might be done. JISC is (I believe) planning one last round of its Capital Programme. Will they include provisions to enhance repositories so as to take these more complex resources? I certainly hope so!

European e-Science Digital Repository Consultation

Philip Lord wrote to tell me that he and and Alison Macdonald are conducting a study for the European Commission “Towards a European e-Infrastructure for e-Science Digital Repositories”, (e-SciDR) – see www.e-SciDR.eu. This is a short study to summarize the situation regarding repositories in Europe and to propose policies for the Commission for repository development in Europe. As part of the study process the Commission is hosting a public consultation through a questionnaire... The letter inviting participation follows:

"Dear Sir, Dear Madam,

May I invite you, as key stakeholders, to contribute to the development of a knowledge society and digital infrastructure in Europe, by taking part in the Commission’s online public consultation on e-Science Digital Repositories which is available at http://ec.europa.eu/yourvoice/ipm/forms/dispatch?form=eSciDR."

[NB Safari on the Mac appears not to work with this questionnaire, but Firefox does.]

"This consultation forms a key part of the e-SciDR study funded by the Commission into repositories holding digital data and publications for use in the sciences (in the widest sense encompassing disciplines from the humanities and social sciences to the life sciences).

Your answers will help identify needs, priorities and opportunities which the European Union, through the Commission, can help address and drive forward in the FP7 Capacity Programme and will provide an important input to developing future policy initiatives.

I would be grateful if you could respond to the consultation by no later than 30 July 2007.

All answers will be strictly confidential and anonymised.

If you would like to receive a summary of the consultation results, please tick the corresponding box on the questionnaire.

Best regards,

Mário Campolargo

Head of Unit GÉANT & e-Infrastructure"

Tuesday, 17 July 2007

A little more on very long term time series

I wrote earlier about a visit to Rothamsted Research, to talk with them about some of their very long term time series of agricultural research data (since 1843... digitised since 1991). Asking around, the prevailing wisdom seems to be to break the time series data when the nature of the data changes. Keep those time series, un-touched. Then build your overall time-series by a set of transformations on the original datasets, where the actual transformations are well-documented.

Of course if (as is perhaps the nature of such agricultural experiments) the nature of the data changes pretty well every year, then you have to keep a series of one-year data snapshots. And that sounds pretty much like what Rothamsted's batch "sheets" are doing.

Meanwhile, I'm continuing to look for something a bit more authoritative than "prevailing wisdom"! Someone has promised me a reference to some Norwegian economic or price time series kept since the 16th century, so I'm hopeful! Any hints from readers welcome...

The National Archives and Microsoft join forces...

On 4 July, The National Archives and Microsoft announced a Memorandum of Understanding "ensuring preservation of the nation´s digital records from the past, present and into the future". Partly this relates to the standardisation of the Office Open XML format for Microsoft's products (I note that O'Reilly appear strongly in favour of this activity, seeing no conflict with the standardisation of Open Document Format). I had thought, through the rumour mill, that MS had provided (or was to provide) the specifications of obsolete file formats. This would fit with TNA's PRONOM file format registry, and their Seamless Flow programme. However, the key paragraph instead seems to be:

"Today´s announcement sees Microsoft provide The National Archives with access to previous versions of Microsoft´s Windows operating systems and Office applications powered by Microsoft Virtual PC 2007. Virtual PC 2007 enables people to run multiple operating systems at the same time on the same computer. This allows The National Archives to configure any combination of Windows and Office from one PC, thereby allowing access to practically any document based on legacy Microsoft file formats. It is estimated that The National Archives will have to manage many terabytes of data in these formats."

This sounds as if TNA must keep licensed versions of all older MS products, but then can run them under this emulation mode. This is a LOT better than nothing, but not as good as open access to the old file formats, with the ability to build additional tools that implies. But maybe there's more than is apparent in the press release?

Should data people care? Apart from the documentation and other information that many will have stored in obsolete proprietary formats, it turns out that many store their data that way as well. The excellent JISC-funded StORe project did a survey of several disciplines, and got over 350 responses. Although one respondent said “God preserve us from idiots who archive data in proprietary commercial formats (Excel spreadsheets and MS-word documents)!”, 220 said they kept source data in spreadsheets and the same number kept them in word processed documents. These were the largest categories except images (228)!

Let's hope they are keeping them somewhere else as well...

Monday, 16 July 2007

Open Data... Open Season?

Peter Murray Rust is an enthusiastic advocate of Open Data (the discussion runs right through his blog, this link is just to one of his articles that is close to the subject). I understand him to want to make science data openly accessible for scientific access and re-use. It sounds a pretty good thing! Are there significant downsides?

Mags McGinley recently posted in the DCC Blawg about the report "Building the Infrastructure for Data Access and Reuse in Collaborative Research" from the Australian OAK Law project. This report includes a substantial section (Chapter 4) on Current Practices and Attitudes to Data Sharing, which includes 31 examples, many from the genomics and related areas. Peter MR wants a very strong definition of Open Access (defined by Peter Suber as BBB, for Budapest, Bethesda and Berlin, which effectively requires no restrictions on reuse, even commercially). Although licences were often not clear, what could be inferred in these 31 cases generally would probably not fit the BBB definition.

However, buried in the middle of the report is a cautionary tale. Towards the end of chapter 4, there is a section on risks of open data in relation to patents, following on from experiences in the Human Genome and related projects.

"Claire Driscoll of the NIH describes the dilemma as follows:

It would be theoretically possible for an unscrupulous company or entity to add on a trivial amount of information to the published…data and then attempt to secure ‘parasitic’ patent claims such that all others would be prohibited from using the original public data."

(The reference given is Claire T Driscoll, ‘NIH data and resource sharing, data release and intellectual property policies for genomics community resource projects’ Expert Opin. Ther. Patents (2005) 15(1), 4)

The report goes on:

"Consequently, subsequent research projects relied on licensing methods in an attempt to restrict the development of intellectual property in downstream discoveries based on the disclosed data, rather than simply releasing the data into the public domain."

They then discuss the HapMap (International Haplotype) project, which attempted to make data available while restricting the possibilities for parasitic patenting.

"Individual genotypes were made available on the HapMap website, but anyone seeking to use the research data was first required to register via the website and enter into a click-wrap licence for the use of the data. The licence entered into, the International HapMap Project Public Access Licence, was explicitly modeled on the General Public Licence (GPL) used by open source software developers. A central term of the licence related to patents. It allowed users of the HapMap data to file patent applications on associations they uncovered between particular SNP data and disease or disease susceptibility, but the patent had to allow further use of the HapMap data. The licence specifically prohibited licensees from combining the HapMap data with their own in order to seek product patents..."

Checking HapMap, the Project's Data Release Policy describes the process, but the link to the Click-Wrap agreement says that the data is now open. See also the NIH press release). There were obvious problems, in that the data could not be incorporated into more open databases. The turning point for them seems to be:

"...advances led the consortium to conclude that the patterns of human genetic variation can readily be determined clearly enough from the primary genotype data to constitute prior art. Thus, in the view of the consortium, derivation of haplotypes and 'haplotype tag SNPs' from HapMap data should be considered obvious and thus not patentable. Therefore, the original reasons for imposing the licensing requirement no longer exist and the requirement can be dropped."

So, they don't say the threat does not exist from all such open data releases, but that it was mitigated in this case.

Are there other examples of these kinds of restrictions being imposed? Or of problems ensuing because they have not been imposed, and the data left open? (Note, I'm not at all advocating closed access!)

Government responds to UK Science Funding petition

8,623 people signed the petition to the UK Government on the £68 million funding reduction to science. The Government has just published a response. After some phrases indicating that the science budget continues to rise, the key paragraph is:

"The Department of Trade and Industry had been facing a number of new and historic budgetary pressures which required action to keep within its budgets. Non-ring-fenced budgets had been reduced as far as possible, so the Department then had to consider its ringfenced budgets, including the Science Budget. It was decided to use part of the underspends in the Science budget that had been accumulated in previous years. This decision did not affect either the 2006-07 budget allocations, or the 2007-08 budget allocations , nor did it affect the commitments set out in the 10 Year Science and Investment Framework."

I suppose accumulated underspends might be in neither the 2006-07 budget nor in the 2007-08 budget, but the impact of the reduction has definitely been felt in the 2007-08 year!

The original petition asked the Government to "revise research funding via the DTI", or alternatively "I wish the Government to review its recent decision outlined [in the text]...". I guess that's a NO to the first but perhaps a yes to the second (a review doesn't necessarily mean a change).

Thursday, 12 July 2007

Very long term data

Rothamsted Research is an agricultural research organisation based near Harpenden in England. There are many interesting features of this organisation (only a few of which I know), including its “classical experiments". One of these, started in 1843, must surely be one of the longest-running experiments with resulting time-series data anywhere. I visited this week, spoke to Chris Rawlings and others handling the data for this “Broadbalk” experiment on wheat yields, and also a couple of scientists working on somewhat younger experiments collecting moths (1930s) and aphids (1960s) on a daily basis (using light traps and vacuum traps respectively).

Digital preservation theory tells us that digital data are at risk from various kinds of changes in the environment. Most often we focus on the media, or the risk of format incompatibility. OAIS rightly asks us to think about contextual metadata of various kinds, but also recognises that there is risk of semantic drift and/or semantic loss rendering once-clear resources incomprehensible (this is the business about the Designated Community and its Knowledge Base that I wrote about recently). The OAIS model seems to me (though some argue against this view) premised on a preservation or perhaps archival view: resources are ingested, preserved within the archive, and then disseminated at a later date. I’ve always had doubts about how easily this fits with a more continuous curation model, where the data are ingested, managed, preserved and disseminated simultaneously.

However, intuitively even in the curation situation I have described, if “long enough: time passes, many of the concerns of digital preservation will apply. And the Broadbalk wheat yield experiment since 1843 is certainly long enough. In that time there has been semantic drift (words mean different things), changes in units (imperial to metric), changes in plot size/granularity and plot labelling more than once, changes in what is measured (eg dropping wet hay weight while keeping dry yield) and how it is measured (the whole plot or a sample), changes in accuracy of measurement, and many changes in treatments. These are very serious changes that could significantly affect interpretation and analysis.

In the case of the aphids experiment, I saw a log book in which changes in interpretation etc were meticulously recorded. Unfortunately I don't have a copy of any pages from that book to describe in more detail the kinds of issues it raised. Although weekly aphid bulletins are made available (explicitly not "published"!), the book itself has, I believe, not been published. My guess is that this book forms a critical part of the provenance information by which the quality of the data can be judged.

Of course, the Broadbalk experiment has not been digital since 1843. I didn’t see the records, but the early ones would have been longhand in ledgers or notebooks, and later perhaps typed up. I get the impression it was quite systematic from the beginning, so converting it to an Oracle database in around 1991 was a feasible if major task. Now they are planning to convert it to a new database system and perhaps make the data more widely available, so they are asking questions about how they should better handle these changes.

By the way, in one respect the experiments continue to be decidedly non-digital. Since the beginning, they have been collecting soil samples from the plots, and now have over 10,000 of them! This makes a fascinating physical collection as a reference comparator for the digital records.

I should say that they have done a lot of hard thinking and good work about some of these issues. In particular, they have arranged all their input data into batches (known as “sheets”), which sounds pretty much self-describing in what is effectively a purpose-built data description language. This means they can roll back and roll forward their databases using different approaches. These sheets are all prepared using the assumptions of their time (remembering that some were prepared more than a hundred years after the data they contain was first recorded), but in theory there is enough information about these assumptions to make decisions. So once they have decided how to deal with some of these changes, they are able to do the best job possible of implementing it.

A Swedish forestry scientist once argued forcefully, in the discussion session to a presentation, that the original observations were sacrosanct, must be kept and should not be changed. Do what you like with the analysis, he suggested: re-run it with varied parameters, re-analyse with new models, (implicitly) make the sorts of changes suggested here. This approach would suggest starting a new time series dataset whenever there is a change of the sort we have been describing. That’s not exactly what has been happening with the Broadbalk experiment, but the sheets approach is if anything even finer grained.

I’m interested to collate experience from other long-running experiments that may have faced these and related issues. So far I have heard of the Harvard Forest project, part of the NSF Long Term Ecological Research (LTER) Network (thanks Raj Bose), and some very long-running birth cohort studies in the social sciences. Further suggestions on comparators and literature to look at would be welcome.

Open Access everyone? Or not?

I occasionally look at the OpenDOAR service, which list information about repositories, and check out those which claim to include data (the term they use is datasets, although it is possible that “other” might also be applicable!). Last time I looked there were 7 listed in the UK which claimed to collect datasets; 4 of these were institutional repositories, one is the Southampton eCrystals archive, the 6th is the National Digital Archive of Datasets (NDAD), which collects government datasets on behalf of The National Archive, and the 7th is Nature Precedings.

I used to think that was a creative piece of listing by NDAD, since OpenDOAR was created for institutional repositories, wasn’t it? But as time has gone on, I’ve begun to think NDAD are right; the old definitions of repositories were much too tight, and NDAD and many other data archives ought to be considered as repositories, and listed in these kind of resources. On that basis perhaps AHDS should have been listed (although they might not bother now) and, I thought, the UK Data Archive, part of the Economic and Social Data Service, should also be listed. After all, they are repositories, and open access is a good thing, isn’t it?

At the UKDA’s 40th birthday celebrations yesterday, it was clear that Kevin Schurer (Director of UKDA) doesn’t quite share that view. The view expressed certainly was not a total rejection of Open Access in favour of a commercial approach. But he was certainly arguing for some barriers between some data and the users. In particular, while access to UKDA data is free for certain categories of users (certainly UK academics and probably some others), ALL users are required to register. Kevin made a strong case for this being an advantage; registration means that he can report to his funders who his users are (including how many of them might be independent or Government researchers as well as academic ones), and can also monitor which datasets they are using. He knows, for example, that the user base has grown from 25 or so after the first 5 years to 45,000 (I don't remember the exact figure, but of that order) after the first 40 years.

The same registration mechanism (and associated authentication and authorisation mechanisms) also allow them to apply greater access controls to more sensitive datasets, including the possibility that the user may have to sign and observe special licences. It’s worth remembering that much of the data they hold is about people, and some is extremely sensitive information, which has been provided (under “informed consent”) for certain specific purposes.

In a very different environment the day before, I met some scientists with very long-term experiments collecting data on insects. For perhaps different reasons, they were happy to make their data available to collaborators, but not openly available on the web. Too many risks of mis-interpretation, which then requires extra effort to refute, was one reason. No doubt an extra paper as co-author from a collaboration was another motive (not unreasonable).

Are these approaches Open? Are they consistent with the OECD Principles and Guidelines for Access to Research Data that I wrote about earlier? Let’s remember that the Openness Principle was carefully worded:

“Openness means access on equal terms for the international research community at the lowest possible cost, preferably at no more than the marginal cost of dissemination. Open access to research data from public funding should be easy, timely, user-friendly and preferably Internet-based.”

So it looks as though the approach is reasonably consistent, provided access is on “equal terms for the international research community”, by that definition. I suspect there are other definitions (who said “the nice thing about standards is there are so many to choose from”?) by which these approaches would fail (eg the Berlin Declaration).

What the UKDA approach might do is make certain kinds of comparative work, including automated data mining, more difficult. There was a plea with respect to the former, from an Australian speaker, for more sophisticated international cross-archive access management (code: Shibboleth-enable). But I suspect with respect to the latter (preventing data mining) Kevin might argue “very right and proper too”!

Wednesday, 11 July 2007

UKDA 40th Birthday: back to basics?

There are not many digital data management organisations that can claim 40 years of continuous service. This week the UK Data Archive (UKDA, which holds mainly social science datasets) celebrates 40 years since its founding in 1967, with a party yesterday in the House of Commons followed by a small workshop in the UKDA’s fancy new quarters at the University of Essex. The DCC would like to congratulate the UKDA, Director Kevin Schurer and his 6 predecessor Directors, and the UKDA staff, on achieving this milestone.

[Pause to suppress pangs of envy at such longevity. Sigh!]

There were many very interesting aspects to this workshop, but here I will focus on one particular contribution, during the closing sessions, from Myron Gutmann. Myron is Director of ICPSR, a roughly equivalent organisation in the US, based at Michigan. Myron said he wanted to argue for retaining the basics. A data archive, he said, should do (at least) 5 things:

Appraise
Curate
Preserve
Train
Protect

This is not quite the language of OAIS, but perhaps closer to the language of archives. Appraise because everything an archive does is expensive, so it has a responsibility to select high quality resources appropriate to its mission (and as OAIS might suggest, its Designated Community). Curate because (apparently) even the best resources tend to need extensive work to make the usable (this work includes making the contextual and other metadata about the resource usable by non-insiders; we might perhaps describe this perhaps as preservation metadata, maybe as part of its representation information). Preserve for obvious reasons, to make resources available in the long term (although some people at the workshop seemed to be using the word in a special sense: institutional repositories are not about preservation, said someone, although I may have mis-quoted). Train, because these resources are specialist, and often need significant work and extra knowledge on the part of users. And protect, both as protection of the intellectual property in the resource, but more importantly protecting the subjects of the data, whose sensitive data must be made available only under appropriate conditions.

It’s a pretty good summary of the job of an archive, give or take a verb or two. Myron added something like “serve new user bases” and “innovate” in various ways, and it’s hard to argue with.

There were some sober reflections on sustainability at the workshop, partly relating to the difficult funding position for long-term longitudinal surveys, but partly reflecting the recent AHRC decisions. Paraphrasing Kevin Schurer, we never could or should take things for granted, but now we must be doubly persuasive, even if the UKDA's main funder ESRC regards ESDS (of which UKDA is a key part) as “a jewel in its crown”.

Friday, 6 July 2007

On blog authorship and (un)certainty

There's a difference between a blog and an article, and it seems to me it's about certainty. Why should I write this blog? It could be to record trivial events, it could be self-aggrandisement, but I think it's about dealing with uncertainty. If I were fully convinced about the detail, I guess I would write an article and submit it to a journal. But generally I'm speculating a bit; trying to focus my mind by writing something out as clearly as I can for an unknown audience (an audience that has the power to answer back). There's also a nice comment in David Rosenthal's blog: "Taking small, measurable steps quickly is vastly more productive than taking large steps slowly, especially when the value of the large step takes even longer to become evident".

I know I'm writing things that colleagues may not agree with. Sometimes, I expect the process of writing to move me closer to their ideas. Sometimes I might hope they move closer to my ideas. Sometimes we may have to learn to live with our differences. The point for me is to generate some strange kind of conversation, not just with close colleagues but to get some feedback from colleagues interested in data curation across the world.

If I can work out what is causing confusion, articulate it and resolve it, perhaps I can stop being "confused of Kenilworth"!

Representation Information: what is it and why is it important?

Representation Information is a key and often misunderstood concept. To understand it, we need to look at some definitions. First of all, OAIS (CCSDS 2002) defines data thus:

“Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”

Second, we have Information:

“Information: Any type of knowledge that can be exchanged. In an exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret a string of bits as numbers representing temperature observations measured in degrees Celsius (the representation information).”

Then we have Representation Information (sometimes abbreviated as RI):

“Representation Information: The information that maps a Data Object into more meaningful concepts. An example is the ASCII definition that describes how a sequence of bits (i.e., a Data Object) is mapped into a symbol.”

As an example, we have this paragraph:

"Information is defined as any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some type of data. For example, the information in a hardcopy book is typically expressed by the observable characters (the data) which, when they are combined with a knowledge of the language used (the Knowledge Base), are converted to more meaningful information. If the recipient does not already include English in its Knowledge Base, then the English text (the data) needs to be accompanied by English dictionary and grammar information (i.e., Representation Information) in a form that is understandable using the recipient’s Knowledge Base.”

The summary is that “Data interpreted using its Representation Information yields Information”.

Now we have a key complication:

“Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding.“

So now we need another couple of definitions:

“Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.”

“Knowledge Base: A set of information, incorporated by a person or system, that allows that person or system to understand received information.”

So there are several interesting things here. First is that this obviously enshrines a particular understanding of information; one I couldn’t find in Wikipedia when I last looked (here is the article at that time; may be it will be there next time!). Floridi suggests there is no commonly accepted definition of information, and that it is polysemantic, and particularly contrasts information in Shannon’s Mathematical Theory of Communication with the “Standard Definition of Information” (Floridi, 2005). If I understand it rightly, the latter refers to factual information (with some controversy on whether it need be true), but not necessarily to instructional information (“how”).

Secondly, the introduction of the Designated Community and its Knowledge Base may be both helpful and problematic. It may be helpful because it can reduce the amount of Representation Information needed to interpret data (or even eliminate it completely). This is if the Designated Community is defined as having a Knowledge Base that allows it to understand the data, then nothing more is required. This is obviously never entirely true, and in practice even with a Designated Community that is quite strongly familiar with the data, we will expect to need some RI, perhaps to identify the particular meaning of some variables, etc.

The problematic nature arises because we now have two external concepts, the Designated Community and its Knowledge Base that influence what we must create, and which will change and must be monitored. I’ve heard the words “precise definition” used in the context of these two terms, but I am sceptical anyone can define either precisely (although the LOCKSS Statement of Conformance with OAIS has a minimalistic go; it's the only public one I could find, but I would love to see more). My colleague David Giaretta suggests that his huge project CASPAR aims to produce better definitions.

In fact, although they may be useful ideas, both the Designated Community and its Knowledge Base seem to be quite worrying terms. The best we can say is that “chemists” (for example) understand “”chemical concepts”, and that the latter have proved pretty stable at least in basic forms. But the community of chemists turns out to include a myriad of sub-disciplines, with their own subtleties of terminology, and not surprisingly introducing new concepts and abandoning old ones all the time. If we have some chemical data in our repository, we have to watch out for these concepts going from current through obsolescent, obsolete to arcane, and in theory we have to add RI at each change, to make up for the increasing gap in understanding.

The third interesting feature is that these definitions say nothing about files or file formats at all, yet “format registries” are the most common response to meeting the need for RI. TNA’s PRONOM, and the Harvard/OCLC Global Digital Format Registry (GDFR) are the two best-known examples.

Clearly files and file formats play a critical role in digital preservation. Sometimes I think this has occurred because of the roots of much of digital preservation (although not OAIS) lie in the library and cultural heritage communities, dominated as they are by complex proprietary file formats like Microsoft Word. In science, formats are probably much simpler overall, but other aspects may be more critical to “understand” (ie use in a computation) the data.

The best example I know to illustrate the difference between file format information and RI is to imagine a social science survey dataset encoded with SPSS. We may have all the capabilities required to interpret SPSS files, but still not be able to make sense of the dataset if we do not know the meaning of the variables, or do not have access to the original questionnaires. Both the latter would qualify as RI. Database schemas may provide another example of RI.

Have I shown why or how RI is a useful concept in digital curation? I'm not sure, but at least there's a start. Representation Information, as David Giaretta sometimes says, is useful for interpreting unfamiliar data!

In later posts, I’m going to try to include some specific examples of RI that relates to science data. I also intend to try to justify more strongly the role of RI in curation rather than preservation, ie through life rather than just at the end of it!

* CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.

* FLORIDI, L. (2005) Is Semantic Information Meaningful Data? Philosophy and Phenomenological Research, 70, 351-370. http://www.ingentaconnect.com/content/ips/ppr/2005/00000070/00000002/art00004