Thursday 28 June 2007

OECD Principles and Guidelines for Access to Research Data

Drafts of the OECD’s Principles and Guidelines for Access to Research data from Public Funding (OECD, 2007) have been available for some time, and the final Recommendation was approved in December 2006. I have only recently had the chance to read the report that details and explains this recommendation. This is a very important document, which could have a major effect on our scientific information systems.

The arguments they put forward in support of the Recommendation are powerful:
“Effective access to research data, in a responsible and efficient manner, is required to take full advantage of the new opportunities and benefits offered by ICTs. Accessibility to research data has become an important condition in:

• The good stewardship of the public investment in factual information;
• The creation of strong value chains of innovation;
• The enhancement of value from international co-operation.

More specifically, improved access to, and sharing of, data:

• Reinforces open scientific inquiry;
• Encourages diversity of analysis and opinion;
• Promotes new research;
• Makes possible the testing of new or alternative hypotheses and methods of analysis;
• Supports studies on data collection methods and measurement;
• Facilitates the education of new researchers;
• Enables the exploration of topics not envisioned by the initial investigators;
• Permits the creation of new data sets when data from multiple sources are combined.

Sharing and open access to publicly funded research data not only helps
to maximise the research potential of new digital technologies and networks,
but provides greater returns from the public investment in research.”

I had not realised the strength of Recommendations. The report makes clear that a Recommendation is a “legal instrument of the OECD that is not legally binding but through a long standing practice of the member countries, is considered to have a great moral force”, and calls it a “soft law”. They say “Recommendations are considered to be vehicles for change, and OECD member countries need not, on the day of adoption, already be in conformity. What is expected is that they will seriously work towards attaining the standard or objective within a reasonable time frame considering the extent of difficulty in closing the gap in each member country.”
The actual Recommendation is perhaps not as strong as this might suggest. It states “that member countries should take into consideration the Principles and Guidelines on Access to Research Data from Public Funding set out in the Annex to the Recommendation, as appropriate for each member country, to develop policies and good practices related to the accessibility, use and management of research data.” This falls rather short of a recommendation to implement them. However, they do propose to review implementation after a period.

The message is that when deciding on access arrangements to research data (both well defined), the governments should take account of the principles and guidelines. The Principles are, in summary:

• Openness
• Transparency
• Legal conformity
• Formal responsibility
• Professionalism
• Protection of intellectual property
• Interoperability
• Quality and security
• Efficiency
• Accountability

They are all important, and it’s well worth reading the document to find out more. However, the first is perhaps key. Note how carefully it is worded:
“Openness means access on equal terms for the international research community at the lowest possible cost, preferably at no more than the marginal cost of dissemination. Open access to research data from public funding should be easy, timely, user-friendly and preferably Internet-based.”
I do commend this document to anyone who is working on policy and implementation aspects of research data resources.

OECD (2007) OECD Principles and Guidelines for Access to Research Data from Public Funding. Paris.

Thursday 14 June 2007

Building expertise in digital curation and preservation

It’s curious when something new has been around for 30 years or so. However, that’s the paradoxical case with digital curation. The term itself is relatively new, and there is still (as usual) confusion as to what exactly it means. But taking the simple definition (maintaining and adding value to a trusted body of digital information over the life-cycle of scholarly and scientific materials, for current and future use), it is clear that some disciplines have had organisations doing this for many years. The Social Sciences have had archives pretty much as long as they have had computing facilities (eg UKDA, ICPSR), and there are environmental science archives dating from at least the early years of satellite technology (eg National Satellite Land Remote Sensing Data Archive and NOAA). In both these cases, the unrepeatable nature of the observations has been a major factor.

One interesting aspect of this is that in those and other disciplines, the archives invented what is becoming digital curation, separately and for themselves. There was often communication among the separate archives in a discipline, but not necessarily with other disciplines, even if quite closely related.

Another strange factor is how this isolation of mindset carries on in other guises today. One can occasionally hear discussions on how a data archive is not a repository (eg it doesn’t support the OAI-PMH!). There have been several contemporary groups doing related things in almost complete isolation from each other. However, I think this tendency is diminishing, and we certainly hope to aid this process!

In fact we could recognise at least the following as elements in some sort of repository space, relating to the curation of science and research knowledge:

• Physical repositories, such as archives, libraries and museums. There is a physical-digital boundary, but in some ways it ought to be irrelevant to the information user. Information should certainly be accessible from either domain, and the valuable skills of curating the information should be deployed from both.

• Institutional, subject and discipline repositories based on software supporting the Open Access Initiative principles and its Protocol for Metadata Harvesting (OAI-PMH). DSpace, FEDORA and ePrints are often seen as the prime technology candidates here (described in a page from the EThOS project), and there are examples in the UK of institutional repositories built on all of them. Subject/discipline repositories tend to be much more DIY; maybe because they mainly pre-date the OAI phase. However, this is where long term preservation repositories linking themselves to the Open Archival Information System standard (OAIS) also belong, as Trusted Digital Repositories (TDRs). There are no archetypical TDRs at this point, although there are some candidates (eg DAITSS and kopal), and a number of suppliers who will build a TDR to requirements.

• Databases, as in the bio-sciences for example. It’s clear that databases will play an increasing role in the future, whether as relational, object, XML, RDF or in some other form. Much scientific data is held in databases (particularly if we are flexible and count a product like Excel as a primitive database).

• Shared document repositories with their supporting infrastructure. SourceForge is probably the best-known example, heavily used in the open source software field. Google Docs may provide a related capability for more “office-like” objects. Wikis, perhaps bulletin board systems (like the DCC Forum), maybe email list archives and to an extent blogs fit somewhere in this space.

• Perhaps file systems (distributed, networked, shared) can fall into this space as well. While this would neglect many important aspects of curation above the bit stream level, a persistent (and therefore redundantly distributed) file store is a requirement for long term curation of data.

• Web space of some kinds can also qualify (wikis fit here as well). How often do we meet the argument that there’s no point in putting some object in a repository “because it’s on my web site”? One of Andy Powell's arguments at the JISC Repositories Conference was that we should make repositories much more like the good Web 2.0 web sites: the ones that the public like to use, find fun, and see value in.

• Finally perhaps we would include data grids, used for some kinds of applications in e-Science and e-Research more generally. These can be massive distributed data systems, where the user need not know “where” the data are; theoretically an economic decision can be taken on the fly as to whether to move the computation to the data, or the data to the computation. SRB and OGSA-DAI (apples and oranges, I know)are the technology candidates here.

The point of this list is that it is clear that each of these “spaces” has its own traditions and associated skill sets, its adherents and detractors, its own advantages and disadvantages (which will not be the same for all users and cases). It would be helpful if we could abstract the good and problematic qualities from each and inform the others so that problems can be overcome and good practice spread.

This is clearly a significant role for the Digital Curation Centre. Our activities in assembling resources, building tools and providing events have been building towards this end of identifying and sharing good practice amongst practitioners.

JISC Repositories conference day 2

This is a rather late posting about day 2 of the JISC Repositories Conference, a week or so ago now. I mainly attended the data-oriented stream. I was very interested in the presentations from the StORe and CLADDIER projects, both of which touched on data citation. They go about this in different ways. StORe’s approach is described as “inspired by Web 2.0 approaches”, although I have not quite cracked the extent to which this applies. It appears to depend on a system distinct from both the Source data and the Output document (ie the derived or published paper); this separate system effectively holds the two-way links between the two, and so allows the links to be made without changes to either data or text. It’s maybe similar in this respect to the BioDAS (distributed annotation server) system?

CLADDIER has a different approach. They have spent a lot of time thinking about what citation means in the data world, eg to what extent data should have some status akin to “published” before it’s cited. They have explored OGC-standard format citations, and simpler textual ones. Sam Peploe for CLADDIER spoke of different things that people want from citations

- the research scientists wants an unambiguous reference to what was used
- the dataset creator wants the dataset cited consistently, so that uses made of it can be discovered
- the data archive wants a clear indication of what it is preserving (I’m not quite clear on this one!).

He says part of the problem is that there is no shared understanding of datasets, and points out that people do not cite with precision, eg “version 2 of radial products” rather than “dataset ds2033434043”. He mentioned a model which uses “pings” to detect citations.

I found both projects very interesting; there is much more to learn here, and I believe we should be exploring both these approaches, and other alternatives.

David Shotton spoke of his Imageweb work, using semantic web technologies to integrate collections of data including scientific images, to make them more useful and usable. We are working with David through the DCC SPARC project, and I’ll write more about this later.

The data panel session (in which I participated) covered a wide range of topics linked around the theme “new opportunities for repositories to support the research life cycle: what’s missing, what’s next”. I’ve summarized some of this as follows:

- repositories to support research must exist (reference to the AHDS issue!)
- we need to tackle the question of the requirement for domain knowledge to curate scientific data, versus the ability (or otherwise) of generalist institutional repositories to start stepping up to this issue (only 5 in the UK even claim an interest in dataset content so far).
- Repositories must be easier for the scientists to use, both in deposit and in re-use (harks back to Andy Powell’s opening keynote). We need to make them part of the workflow, adding better metadata cheaply; make them sexy; and make them more powerful, with semantic web potential (SPARQL endpoints?)
- Recognizing that citations are academic currency, we need to work further on this
- We need more work on curation of research data within university settings (linked to emerging “big research disk capacity” systems).
- We should work harder to get annotation and comment, including options for “citizen science”, eg work being done to add herbarium metadata
- We should take emerging options to exploit “Google effects” such as the map capabilities, by providing better tools for geospatial data depositors and users, eg automatic “bounding box” extraction and presentation.

The day ended with an overall panel, which inevitably had quite some discussion about the AHDS issue I have written about earlier.

I didn’t leave the conference thinking that curation of data was in best shape, either in domain data archives, or in institutional repositories. But I did leave believing there was still productive work to be done in both of those areas!

Wednesday 6 June 2007

Arts and Humanities Data Service decision

On 14 May 2007, the UK Arts and Humanities Research Council (AHRC) issued a rather strange press release, announcing that “AHRC Council has decided to cease funding the Arts and Humanities Data Service (AHDS) from March 2008” and at the same time announcing a change to its conditions of grant relating to deposit, removing the condition that material be offered for deposit in AHDS, although “Grant holders must make materials they had planned to deposit with the AHDS available in an accessible depository for at least three years after the end of their grant”.

This is a strange decision from a number of points of view. The first is the way it has been announced, the short period of notice (a call for proposals was currently open, closing in June), and the sudden-death cutoff of funds. Yes, there was a funding crisis, but the speed of the decision makes any sensible exit strategy for the AHDS difficult or impossible to work through… and it doesn’t appear that the AHRC is going to help create one.

Secondly it is strange because the Arts and Humanities (I’ll shorten to “the Arts” for simplicity) represent discipline areas that use and rely on a wide variety of resource types. Unlike many of the sciences, journals do not represent the most significant part of output or input in the Arts and Humanities. The monograph has always been important as a research output, but is now under threat through the pressure on the library cost base from the journals pricing crisis, and must change (and the discipline with it) or die. One way it is changing is to become virtual, and in the process to become richer… just the sort of complex resources the AHDS was set up to capture and curate. And at the same time, input has been a wide variety of resources, extending well beyond libraries to include museums, galleries, archives, theatres, dance halls, street corners… wherever the expressive power of human creativity is apparent. Many of the resources studied have been very poorly represented in traditional monographs, and can be much better represented for critical discussion in a rich digital form.

Thirdly it’s strange in its assumption that other “accessible depositories” (a term few of us have heard before; we think they might mean institutional repositories) can take up the missing role. One of the problems for the AHDS has always been the result of what is described in my previous paragraph: the resources offered are sometimes very complex web sites, based on (sometimes arbitrary) use of a wide variety of technologies, and the capture and curation of these has been a challenge which has required the AHDS to build up a considerable skill base. This skill base is just not available to the operators of UK institutional repositories, which so far have struggled to ingest mostly simple text objects (eprints). They will not easily fill this gap; they simply cannot ingest the sort of resources the AHDS has been dealing with, given their current and likely future resources. Only 5 of them even claim to include datasets, and few of those actually do; a slightly larger number claim to support “special” items, but again nothing complex is apparent on inspection. So the alternatives available to Arts researchers are few (or none).

Oddly the AHRC has not yet (I believe) dropped the condition that the AHDS must help applicants with a “Technical Annex”, supposed to ensure that created resources are easier for it to manage once they become available. Will they expect the AHDS to continue to do this once no longer funded by them? If not, who else will do this? Is there any conceivable way that institutional repositories could take up this role? Not obviously to me!

Fourthly, it is strange in its new assumption that 3 years post grant is a sufficient requirement: this in a discipline area that uses resources from hundreds of years ago!

Finally (for now), this decision is worrying because of the precedent it sets. Subject-based data centres have one great advantage: they can be staffed to provide domain knowledge that is essential for managing and preserving that subject or disciplines data resources. Institutional repositories by contrast are necessarily staffed by generalists, often librarians, who will struggle to deal with the wide variety of data resources they might be required to manage in widely differing ways. IRs have always had (to my mind) a sustainability advantage: they represent the intellectual output of an institution (good promotional and internal value), and they sit within the institutional knowledge infrastructure. Even if one institution decides to drop its IR, the academic knowledge base as a whole is not fatally damaged in any particular discipline. Subject data centres by contrast have always suffered from the disadvantage that their future was uncertain; their sustainability model has always been weak. Now that disadvantage has moved from potential to stark reality in the Arts and Humanities, one of the discipline areas most requiring access to a wide variety of outputs from the past.

If this decision worries you, there is a petition for UK citizens at, which I would urge you to sign. We also need to think hard about ways in which we can deal with the effects of the decision. It is difficult to challenge a major funder, but there are times when it needs to be done. We have to get this right!

[I should have added that Seamus Ross, who is responsible for a small part of the AHDS at Glasgow, is a colleague on the DCC Management Team. However, this posting is made in a personal capacity; while I believe my views may be shared, I have not attempted to get agreement from colleagues on the sentiments expressed.]

JISC Repositories conference day 1

Yesterday and today I am at the JISC Repositories Conference in Manchester. This turns out to be (at least in my small section, and the two plenaries so far) a much more interesting event than I expected. There has been a useful focus on the fringes of the repository movement, such as the role of data, and as well an interesting re-exploration of what repositories are, and what they are for.

The keynote was from Andy Powell of Eduserv (his slides are available on Slideshare), talking about some work that he and Rachel Heery did for JISC on a repositories roadmap. Drawing an interesting parallel with today’s GPS systems, which will give you a new route if you go wrong, he wanted to look at the recommendations they had made, one year later, and see if they still stood up. By and large, the answer seemed to be that they did; however, he still had criticisms. Partly this related to an under-estimation of the role of technology in general and the web in particular, against an over-estimation of the role of policy and advocacy. The repository movement is much less successful than Web 2.0 style “equivalents” (sort of) like Flickr, Slideshare, Scribd etc. Why? He felt it was because the technology of the latter was so much better; it afforded (not his word) what people wanted to do in a much better way than typical repositories do. So in response he wanted us to re-think repositories in terms of the web architecture, and to re-think scholarly communication as a social activity (in Web 2.0 terms). So we must build services that users choose to use, not try to make them use services even though they don’t work well for them!

BTW I had trouble with my presentation, done on Powerpoint on a Mac; I couldn’t connect the Mac to the conference centre projector, nor could I get either of the Windows machines to read the file on my memory stick. I mention this because, inspired by Andy, I registered with Slideshare and in a matter of minutes had managed to upload the presentation and have it converted into the Slideshare system. Not until after my (Powerpoint-free) presentation however!

The second keynote was from Keith Jeffrey from STFC (was CCLRC until March 2007). He said many sensible things relating to the idea of the relationship between eprint repositories and data repositories (keep them separate, they are trying to do different things), and the relationship of both to the CRIS (which I think meant the Campus Research Information System). The latter is needed for research management and evaluation, the RAE etc. The interesting thing is that it contains a large amount of contextual information, which if captured automatically can make it much easier to create useful metadata cheaply. Building metadata capture into the end to end workflow is one of my interest areas, because the high cost of metadata is a significant barrier to the re-use of data, so I was pleased to hear this stated so coherently. Moreover he was implying that using the CERIF data model (from an EU project, I think), you can tie all these entities together in a way that makes harvesting and re-using them (a la Semantic Web) very much easier.

Two other things I’ll briefly mention. Liz Lyon spoke about her consultancy, which will be available as a report in a month or so, on Rights, Roles, Responsibilities and Relationships in dealing with data. I won’t attempt to summarise, but it looks an interesting piece of work, and one well worth watching out for (truth in advertising: Liz is a DCC colleague, based at Bath where she is also and much more importantly Director of UKOLN).

The other particularly interesting paper that really caught me (on about the 3rd hearing I guess) was Simon Coles talking about the R4L (Repository for the Laboratory) project. There are so many interesting features of this project: the use of a private repository to keep intermediate results, the use of health and safety plans to provide useful metadata, the (possibly generic) probity service where you can register your claim to a discovery, the use of both a human blogging service (so interim results can be easily discussed within the group) and a machine blogging service (so autonomous machines continue to blog their results, unattended, for perhaps days), the ability to annotate results with scribbles, sketches and other annotations, the idea of a structured data report based on a template, from which data for publication can be extracted (and which could form the basis of a citable data object), and the various sustainability options they have in place. This struck me again as one of the most interesting projects (or perhaps family of projects, along with Smart Tea and the eBank group) around.