At the second day of the JISC Innovation Forum, I attended an interesting discussion in the data theme on technical infrastructures. This post derives from that discussion, but is neither a complete reflection of what was said, nor at all in the order of discussion; it reflects some bits I found interesting. My thanks to Matthew Dovey who chaired, and to all who contributed; I’ve identified none rather than only some! The session is also being blogged here…
Although early on we were talking about preservation, we came back to a discussion on immediate re-use and sharing; it seems to me that these uses are more powerful drivers than long term preservation. If we organise things right, long term preservation could be a cheap consequence of making things available for re-use and sharing. Motivations quoted here include data as evidence, the validation aspects of the scientific method. We were cautioned that validation from the data is hard; you will need the full data and analysis chain to be effective (more data, providing better provenance…).
Contextual metadata might include (parts of) the original proposal (disclosing purpose and motivation). Funders keep these proposals, as part of their records management systems, but they might not be easily accessible and are likely eventually to be deleted. An action for JISC might be to find ways of making parts of these proposals more appropriately available to support data.
There are issues of the amount of disclosure or discoverability that people want, or maybe should offer whether they want or not; we touched tangentially on open notebook science. Scary, but so is sharing data, for some! Extending re-usability through re-interpreting to integration may be big steps.
To what extent does standardisation help? Well, quite a lot, obviously, although science is slow and science is innovative, so many researchers will either have no appropriate standards to se, or not know about the standards, or have poor or no tools to implement/use the standards, or maybe find the standards less than adequate and overload them with their own private extensions or encodings. And to some extent re-use could be about the process as much as the data (someone spoke of preserving workflows, and someone of preserving protocols, although “process” may be much less formal than either…).
We had a recurring discussion about the extent this feeds back to being a methodological question. There was a question about the efficacy of research practice; how do we maximise positive outcomes? Should we aim change science to achieve better curation? While in the large scale this sounds exceedingly unlikely, research being exceedingly resistant to external change, successes were reported. BADC have managed to get scientists to change their methods to collect better metadata, and maybe better data. Maybe some impact could be made via research training curricula? Another opportunity for JISC?
We talked about the selection of useful data; essential, but how can it be achieved? Some projects, such LHC have this built into the design as the volumes are orders of magnitudes too large to deal with otherwise. While low-retention selection might be attractive, others were arguing elsewhere for retaining more data that are currently lost, to improve provenance (ie successive refinement stages).
The idea of institutional data audit was mentioned; there is a pilot data audit framework being developed now, and at least one institution in the room was participating in the pilot process. This might be a way of bringing issues to management attention, so that institutions can understand their own needs. Extending this more widely may be useful.
We talked about what a research library could do, or maybe has to start thinking of doing. On the small scale, success in digital thesis deposit is bringing problems in managing associated data files (often of widely varied types: audio, video, surveys, spreadsheets, databases, instrumentation outputs, etc). At the moment the answer is often to ZIP up the supplementary files and cross your fingers (as close to benign neglect as possible!). This is very close to the approach in use for years with paper theses, where supplementary materials were often in portable media in a pocket inside the back cover… close the volume and put it on the shelves. This might be acceptable with a trickle, but not a deluge… What would be a better approach in dealing with this growing problem (exacerbated by issues related to the well-known neologism problems of theses, ie that young researchers will likely not use applicable standards in quite the right way)? It was clear that dealing with this stuff is beyond the expertise of most librarians, and requires some sort of partnership with domain scientists. This whole area could represent an opportunity for JISC to help.
Since we were thinking of institutional repositories and larger scale subject repositories, we had a skirmish on the extent to which we need a full-blown OAIS, as it were, or a minimal, just-enough effort. It seemed in a sense the answer was: both! It does make sense to think carefully about OAIS, but also to make sure you do just enough, to ensure that high preservation costs in one area are not preventing collection in other areas (selection anyway being so fallible). A good infrastructure should help keep it cheap. There is a question on how you would know if you were paying too much in effort, or too little? Perhaps repository audit and certification, or perhaps risk assessment might help.
This brought us on to the issue of tool development. Most existing, large scale data centres use home-built, filestore-based systems, not very suitable for small-scale use. Existing repository software platforms are not well suited to data (square peg in round hole was quoted!). Funding developments to improve the fit for data was seen as a possible role for JISC. Adding some better OAIS capabilities at the same time might also be useful, as might linking to Virtual research Environments such as Sakai, or to Current Research Information Systems (CRIS’s). Is the CERIF standard developed by EuroCRIS helpful, or not?
Overall, it was a useful session, and if JISC takes up some of the opportunities for development suggested, it should prove doubly, trebly or even more useful!