This is a rather late posting about day 2 of the JISC Repositories Conference, a week or so ago now. I mainly attended the data-oriented stream. I was very interested in the presentations from the StORe and CLADDIER projects, both of which touched on data citation. They go about this in different ways. StORe’s approach is described as “inspired by Web 2.0 approaches”, although I have not quite cracked the extent to which this applies. It appears to depend on a system distinct from both the Source data and the Output document (ie the derived or published paper); this separate system effectively holds the two-way links between the two, and so allows the links to be made without changes to either data or text. It’s maybe similar in this respect to the BioDAS (distributed annotation server) system?
CLADDIER has a different approach. They have spent a lot of time thinking about what citation means in the data world, eg to what extent data should have some status akin to “published” before it’s cited. They have explored OGC-standard format citations, and simpler textual ones. Sam Peploe for CLADDIER spoke of different things that people want from citations
- the research scientists wants an unambiguous reference to what was used
- the dataset creator wants the dataset cited consistently, so that uses made of it can be discovered
- the data archive wants a clear indication of what it is preserving (I’m not quite clear on this one!).
He says part of the problem is that there is no shared understanding of datasets, and points out that people do not cite with precision, eg “version 2 of radial products” rather than “dataset ds2033434043”. He mentioned a model which uses “pings” to detect citations.
I found both projects very interesting; there is much more to learn here, and I believe we should be exploring both these approaches, and other alternatives.
David Shotton spoke of his Imageweb work, using semantic web technologies to integrate collections of data including scientific images, to make them more useful and usable. We are working with David through the DCC SPARC project, and I’ll write more about this later.
The data panel session (in which I participated) covered a wide range of topics linked around the theme “new opportunities for repositories to support the research life cycle: what’s missing, what’s next”. I’ve summarized some of this as follows:
- repositories to support research must exist (reference to the AHDS issue!)
- we need to tackle the question of the requirement for domain knowledge to curate scientific data, versus the ability (or otherwise) of generalist institutional repositories to start stepping up to this issue (only 5 in the UK even claim an interest in dataset content so far).
- Repositories must be easier for the scientists to use, both in deposit and in re-use (harks back to Andy Powell’s opening keynote). We need to make them part of the workflow, adding better metadata cheaply; make them sexy; and make them more powerful, with semantic web potential (SPARQL endpoints?)
- Recognizing that citations are academic currency, we need to work further on this
- We need more work on curation of research data within university settings (linked to emerging “big research disk capacity” systems).
- We should work harder to get annotation and comment, including options for “citizen science”, eg work being done to add herbarium metadata
- We should take emerging options to exploit “Google effects” such as the map capabilities, by providing better tools for geospatial data depositors and users, eg automatic “bounding box” extraction and presentation.
The day ended with an overall panel, which inevitably had quite some discussion about the AHDS issue I have written about earlier.
I didn’t leave the conference thinking that curation of data was in best shape, either in domain data archives, or in institutional repositories. But I did leave believing there was still productive work to be done in both of those areas!