Digital Curation Blog: 2008

Tuesday, 30 December 2008

Gibbons, next generation academics, and ir-plus

Merrilee on the hangingtogether blog, with a catchup post about the fall CNI meeting, drew our attention to a presentation by Susan Gibbons of Rochester, on studying next generation academics (ie graduate students), as preparation for enhancements to their repository system, ir-plus.

I shared a platform with Susan only last year in Sydney, and I was very impressed with their approach of using anthropological techniques to study and learn from their user behaviour. In fact, her Sydney presentation probably had a big influence on the ideas for the Research Repository System that I wrote about earlier (which I should have acknowledged at the time).

Now their latest IMLS-funded research project takes this even further, as preparation for collaborative authoring and other tools integrated with the repository. If you're interested in making your repository more relevant to your users, the report on the user studies they have undertaken so far is a Must-Read! I don't know whether the code at http://code.google.com/p/irplus/ is yet complete...

Friday, 19 December 2008

The 12 files of Christmas

(Well, half a dozen, maybe.) OK, it’s Christmas, or the Holiday Season if you really must, and I may have a present for you!

My readers will know that I have an interest in obsolete and obsolescent data and files. I’ve tried in the past to get people to tell me about their stuff that had become inaccessible, with only limited success. So, after our work Christmas lunch, with the warm glow of turkey and Christmas pudding still upon me, and a feeling of general goodwill, I thought I’d offer you an opportunity to give me a challenge.

I will do my best to recover the first half dozen interesting files that I’m told about… of course, what I really mean is that I’ll try and get the community to help recover the data. That’s you!

OK, I define interesting, and it won’t necessarily be clear in advance. The first one of a kind might be interesting, the second one would not. Data from some application common of its time may be more interesting than something hand-coded by you, for you. Data might be more interesting (to me) than text. Something quite simple locked onto strange obsolete media might be interesting, but then again it might be so intractable it stops being interesting. We may even pay someone to extract files from your media, if it’s sufficiently interesting (and if we can find someone equipped to do it).

The only reference for this sort of activity that I know of is (Ross & Gow, 1999, see below), commissioned by the Digital Archiving Working Group.

What about the small print? Well, this is a bit of fun with a learning outcome, but I can’t accept liability for what happens. You have to send me your data, of course, and you are going to have to accept the risk that it might all go wrong. If it’s your only copy, and you don’t (or can’t) take a copy, it might get lost or destroyed in the process. You’ll need to accept that risk; if you don't like it, don't send it. I might not be able to recover anything at all, for many reasons. I’ll send you back any data I can recover, but can’t guarantee to send back any media.

The point of this is to tell the stories of recovering data, so don’t send me anything if you don’t want the story told. I don’t mind keeping your identity private (in fact good practice says that’s the default, although I will ask you if you mind being identified). You can ask for your data to be kept private, but if possible I’d like the right to publish extracts of the data, to illustrate the story.

Don’t send me any data yet! (It’s like those adverts: send no money now!) Send me an email with a description of your data. I’m not including my email address here, but you can find it out easily enough. Include “12 files of Christmas” in your subject line. Tell me about your data: what kind of files, what sort of content, what application, what computer/operating system, what media, roughly what size. Tell me why it’s important to you (eg it’s your thesis or the data that support it, your first address book with long-lost relative’s address on it, etc). Tell me if (no, tell me that) you can grant me the rights to do what’s necessary (making copies or authorising others to make copies, making derivative works, etc).

Is that all? Oh, we need a closing date… that had better be twelfth night, so 11:59 pm GMT on 5 January, 2009 (according to Wikipedia)!

Any final caveats? Yes; I have not run this past any legal people yet, and they may tell me the risks of being sued are too great, or come up with a 12-page contract you wouldn’t want to sign. If that happens, I’ll come back here and ‘fess up.

Ross, S., & Gow, A. (1999). Digital Archaeology: Rescuing Neglected and Damaged Data Resources (Study report). London: University of Glasgow.

Sustainable preservation and access : Interim report

I'm a member of the grandly titled Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which has just released the interim report of its first year's scoping activity. Neil Beagrie spotted the press release, and blogged it earlier.

As the Executive Summary says

"There is general agreement that digital information is fundamental to the conduct of modern research, education, business, commerce, and government. Future accomplishments are accelerated through persistent access to data and digital materials, and their use, re-use, and re-purposing in ways both known and as yet unanticipated.

"There is no general agreement, however, about who is responsible and who should pay for the access to, and preservation of, valuable present and future digital information. "

So that's the focus of the Task Force. It's been a really interesting group to be on; for various reasons, I only managed to get to one of its 3 face to face meetings this year, but there's a wide range of views and skills there (ranging from economists, not at all dismal, to representatives from various digital communities). There was also the interesting approach of taking "testimony" from a number of experts in the field: asking them to come, make short presentations based around some questions we asked them, and to join in the debate. And of course, the inevitable regular telephone conference, as always at an inconvenient time (spanning so many time zones). But, as I say, the conversations have always been interesting, and the discussion yesterday on the methodology for the second year, leading to the final report, was fascinating.

Just one last quote, this time from the Preface:

"The sustainability of the long-term access and preservation of digital materials is a well-known challenge, and discussion frequently focuses on the difficult technical issues. Less clearly articulated are the organizational and economic issues implicit in the notion of sustainability which, at the risk of over-simplification, come down to two questions: How much does it cost? and Who should pay?"

I guess the over-simplified answer to the last question is: if we want it, we all will.

Survey on malformed PDFs?

A DCC-Associates member asks

"Does anyone know there has been a study to estimate how many PDF documents
do not comply with the PDF standards?"

I've not heard of anything, nor can I find one with my best Google searches, but it's a particularly hard question to ask Google! So, if you know of anything that has happened or is in progress, please leave a comment here. Thanks,

Saturday, 13 December 2008

What makes up data curation?

Following some discussion at the Digital Curation Conference in Edinburgh, how about this:

Data Curation comprises

Data management
Adding value to data
Data sharing for re-use
Data preservation for later re-use

Is that a good breakdown for you? Should data creation be in there? I tend to think that data creation belongs to the researcher; once created, the data immediately falls into the data management and adding value categories.

Adding value will have many elements. I’m not sure if collecting and associating contextual metadata is part of adding value, or simply good data management!

Heard at the conference: context is the new metadata (thanks Graeme Pow!).

Wednesday, 10 December 2008

IDCC4: DCC Blawg "consent" post

Just a little plug here; my colleague Mags McGeever, who provides legal information for the DCC, has her own blog (The DCC Blawg), and has recently been providing a bit more comment and opinion rather than just information (that's not legal opinion, mind!). Her recent post on the variant of "informed consent" used by Generation Scotland, mentioned by David Porteous in his opening Keynote at the Digital Curation Conference, is interesting, and some of her other recent articles are worth a look, too.

Tuesday, 9 December 2008

Comments on OAIS responses to our comments on OAIS

Yes, it sounds weird and it was, a bit. One of the workshops at the International Digital Curation Conference was to consider the proposed "dispositions" to the DCC/DPC comments on OAIS, made around two years ago! Sarah Higgins of the DCC and Frances Boyle of the DPC had an initial look and tried to work out which proposed dispositions we might have an issue with. A couple of the original participants made written comments in advance, but the rest of us (the original group or their successors, plus one slightly bemused visitor) had less than 3 hours to hammer through an see whether we could improve on the dispositions. The aim was to identify areas where we really felt the dispositon was wrong, to the extent that it would seriously weaken the standard, AND we were able to provide comments of the form "delete xxxx, insert yyyy". We did identify a few such, but were also pleased to discover that some areas where we thought there would be no change would in fact receive significant revision (although we haven't yet seen the revision).

We understand that the process now is for the OAIS group (MOIMS-DAI) to finalise their text, and then put it out for a second round of public comment early next year. After that, it will go through the CCSDS review process (ie it gets voted on by Space Agencies interested in Mission Operations Information), before going on to ISO, where it gets voted on by National Bodies involved in its Technical Committee! So, don't hold your breath!

IDCC4 afternoon research paper session

I really want to write about the afternoon research papers at the International Digital Curation Conference, but I start from a great disadvantage. First, I wanted to hear papers in BOTH the parallel sessions, so I ended up dashing between the two (and of course, getting lost)! Then my battery went flat, so no notes, and I left the laptop in the first lecture room while going off to the second… and of course, near the end of that paper, someone came in saying a couple of laptops had been stolen. Panic, mine was OK, thank goodness (when did I last do a proper backup? Ahem!).

I’ve reported on the Australian ARCHER and ANDS topics before; the papers were a little different, but the same theme. The paper I really wanted to hear was Jim Frew on Relay-Supporting Archives. Frew’s co-author was Greg Janee; both visited the DCC, Frew for a year and Janee for a few days, talking about the National Geospatial Data Archive. I thought this was a great paper, but I kept thinking “I’m sure I said that”. Well, maybe I did, but maybe I got the ideas from one or other of them! They have a refreshingly realistic approach, recognising that archive survivability is a potential problem. So if content is to survive, it needs to be able to move from archive to successor archive, hence the relay. An archive may then have to hand on content that it received from another archive, ie intrinsically unfamiliar content. How’s this for a challenging quote: “Resurrection is more likely than immortality”! He looked at too-large standards (eg METS), and too-small standards (eg OAI-ORE), and ended up with the just-right standard: the NGDA data model (hmmm, I was with him right up till then!).

The conclusion was interesting, though:

"Time for a New AIHT
Original was technology neutral
New one should test a specific set of candidate interoperability technologies"

The conference closed with an invigorating summing up from Malcolm Atkinson, e-Science Envoy. Slides not available yet; I’ll update when they are. He sent us off with a rousing message: what you do is important, get to it!

See you next year in London?

Strand B1 research papers at IDCC4

In the morning parallel session B at the International Digital Curation Conference, the ever-interesting Jane Hunter from the University of Queensland began the session speaking about her Aus-e-Lit project (linked to Austlit) The project is based on FRBR, and offers federated search interfaces to related distributed databases. They include tagging and annotation services, and compound object authoring tools… based on an OAI-ORE based tool called Literature Online Research Environment (LORE). These compound objects, published to the semantic web as RDF named graphs, can express complex relationships between authors and works that represent parts of literary history, tracking lineage of derivative works and ideas, scholarly editions, research trails, and also used for teaching and learning objects. There were of course problems: the need for unique identifiers for non-information resources; use of local identifiers, use of non-persistent identifiers; community concern about ontology complexity… but also a desire for more complexity (wasn’t it ever thus)!

Hunter was followed by Kai Naumann, from the State Archive of Baden-Wurttemberg. They have many issues, starting with the very wide range of objects from paper, microfilm etc and beginning digital objects, of which a large number are now being ingested. Need to find resources with finding aids regardless of media and object type. Need to maintain and assure authenticity and integrity. They chose to use the PREMIS approach to representations (which reminded me again how annoying the clash of PREMIS terminology with that of OAIS is!). It seemed to be a very sophisticated approach. Naumann suggested that preservation metadata models should balance 3 aims: instant availability, easy ingest and long-term understandability. For low use (as in real archives), with heterogeneous objects, you need to design relatively simple metadata sets. It’s important to maintain relational integrity between the content and metadata (I’m not sure that has a strict, relational database meaning!). Structural relations between content units can be a critical for authenticity.

Jim Downing of Cambridge came next, speaking for Pete Sefton, USQ (although he admitted he was a pale, short, damp substitute for the real thing!). The topic was embedding metadata & semantics in documents; they work together on the JISC-funded ICE-Theorem project. We need semantically rich documents for science, but most documents start in a (business-oriented) word processor. Semantically rich documents enable automation, reduce information loss, have better discovery and presentation. Unfortunately, users (and I think, authors) don’t really distinguish between metadata, semantics and data. Document and metadata creation are not really separable. Documents often have multiple, widely separated authors, particularly in the sciences. Their approach is to make it work with MS Word & OpenOffice Writer, & use Sefton’s ICE system. They need encodable semantics that are round-trip survivable: if created by a rich tool, then processed with a vanilla tool then checked again with the rich tool, the semantics should still be present, for real interoperability. Things they thought of but didn’t do: MS Word Smart Tags, MS Word foreign namespace XML, ODF embedded semantics, anything that breaks WYSIWYG such as wiki markup in document), new encoding standards. What does work? An approach based on microformats (I thought simply labelling it as microformats was a bit of an over-statement, given the apparent glacial pace of “official” microformat standardisation!). They will overload semantics into existing capabilities, eg tables, styles, links, frames, bookmarks, fields (some still fragile).

Paul McKeown from EMC Corp gave a paper written by Stephen Todd on XAM, a standard newly published by SNIA. It originated from the SNIA 100 year archive survey report. Separating logical and physical data formats, and the need over time to migrate content to new archives. Location-independent object naming (XUID), with rich metadata, and pluggable architecture for storage system support. Application tool talks to XAM layer which abstracts away the vendor implementations. 3 primary objects: XAM library (object factory?), XSet, XStreams, etc. Has a retention model. (Sorry Paul, Stephen, I guess I was getting tired at this point!)

Martin Lewis on University Libraries and data curation

Martin Lewis opened the second day of the International Digital Curation Conference with a provocative and amusing keynote on the possible roles of libraries in curating data. It was very early, with his presentation [large PPT] starting at 8:40 am, and the audience after the conference dinner in the splendid environs of Edinburgh Castle was unsurprisingly thin. However, absentees missed an entertaining and thought-provoking start to the day. Martin is great at the provocative remark, usually tongue in cheek; perhaps not all our visitors caught the ironic tone, a peril of speaking thus to an international audience (eg “this slide shows a spectrum of the disciplines, from the arts and humanities on this side, to the subjects with intellectual rigour over here”!).

He mentioned some common keywords when people think of libraries: conservation, preservation, permanent, but perhaps seen as intimidating, risk-averse, conservative. What can libraries do in relation to data? He promised us 8 things, but then came up with a 9th:

Raise awareness of data issues,
lead policy on data management,
provide advice to researchers about their data management early in life cycle,
work with IT colleagues to develop appropriate local data management capacity,
collectively influence UK policy on research data management,
teach data literacy to research students,
develop KB of existing staff,
work with LIS educators to identify required workforce skills (see Swan report), and
enrich UG learning experience through access to research data.

Note, this list does not (currently) include much direct, practical support for data curation or preservation: no value-added data repositories, etc. Although a few libraries are venturing gently into this area, without extra support he believes that libraries are too stretched to be able to take on such a challenge. The library sector is stretched very thin, bearing in mind the major changes to accommodate the change to digital publishing, and large investments in improving support for the teaching and learning side of things, including his own libraries new building (which is leading to increasing library footfall, and increasing loans of non-digital materials, which must still be managed).

(It struck me later, by the way that this presentation was in a way quintessentially British, or at least non-American, in the extent of its attention to these non-research elements; American chief librarian colleagues have seemed to be focused on the research library and less interested in the under-graduate library.)

So bearing the resource problems in mind, Lewis took us through the Library/IT response to the English Funding Council’s Shared Services programme: the UKRDS feasibility study. He noted the scale of the data challenge, still poorly served in policy & infrastructure, the major gaps in current data services provision. University Library and IT Services in several institutions are coming under pressure to provide data-oriented services, mainly just now for storage (necessary but not sufficient for curation). He reminded us that the UK eScience core programme was a world leader; we had many reports to refer to, especially including Liz Lyon’s “Dealing with Data” report, but we are getting to the point where we need services not projects.

The feasibility study surprisingly shows low levels of use of national facilities; most data curation and sharing happens within the institution. The feasibility study identified 3 options:

do nothing,
massive central service, or
coordinated national service.

Despite an amusing side excursion exploring the imagined SWAT teams of a massive central service, swarming in from their attack helicopters to gather up neglected data, the last alternative is a clear winner for the study.

There was strong evidence base of gaps in data curation service provision in UK. Cost savings were hard to calculate in a new area, but were compared with the potential alternative of ubiquitous provision in universities for their own data. UKRDS would seek to embrace rather than replace existing services, but might provide them with additional opportunities. Next steps were to publish their report, which would recommend (and he hopes extract the funds to enable) a Pathfinder phase (operational, not pilot).

During questions, the inevitable from a representative of a discipline area with already well-established support (and I paraphrase): “how can you do the real business of curation, namely adding value to collected datasets, when you are at best generalists without real contact with the domain scientists necessary for curation?” To which, my response is “the only possible way: in partnership”!

Monday, 8 December 2008

Wilbanks on the Control Fallacy: How Radical Sharing out-competes

Closing the first day of the International Digital Curation Conference, and as a prelude to a substantial audience discussion, John Wilbanks from Science Commons outlined his vision and his group’s plans and achievements. His slides are available on Slideshare and from the IDCC web site.

John suggests that the only real alternative to radical sharing is inefficient sharing! Science (research and scholarship, to be more general) is in a way not unlike a giant Wikipedia, an ever-changing consensus machine, based on publishing (disclosing, making things public); advances by individual action, and by discrete edits (ie small changes to the body of research represented by individual research contributions). Unlike Wikipedia the Science consensus machine is slow and expensive, but it does have strong authentication and trust. It represents an “inefficient and expensive ecosystem of processes to peer-produce and review scholarly content”.

However, disruptive processes can’t be planned for, and when they occur, attract opposition from entrenched interests (open access is one such disruptive system). So a scholarly paper may be thought of as “an advertisement for years of scholarship” (I think he gave a reference but I can’t find it). International research forms a highly stable system, good at resisting change on multiple levels, and in many cases this is a Good Thing. This system includes Copyright (Wilbanks suggested IPR was an unpopular term, new to me but perhaps a US perspective); traditionally, in the analogue world, copyright locks up the container, not the facts. New publishing business models lock up even more rights on “rented” information. If we can deposit our own articles (estimated cost 40 minutes per researcher per year), then we can add further services over our own material (individually or collectively), including tracking use and re-use; then we can out-compete the non-sharers.

Science Commons has been working for the past 2 years focusing their efforts in both the particular (one research domain, building the Neurocommons) and the general; their approach requires sharing. They discovered two things quite early on: first that international rules on intellectual property in data vary so widely that building a general data licence to parallel the Creative Commons licences for text was near impossible, and second that viral licences (like the Creative Commons Share-Alike licences, and the GPL) act against sharing, since content under different viral licences cannot be mixed. So their plan is to try putting their data into the public domain as the most free approach. They use a protocol (not licence) for implementing open data. reward comes through trademark, badging as part of the same “tribe”. Enforcement doesn’t apply; community norms rule, as they do in other areas of scholarly publishing. For example attribution is a legal term of art (!!), but in scholarly publishing we prefer the use of citations to acknowledge sources, with the alternative for an author being possible accusations of plagiarism, rather than legal action (in most cases).

There was some more, on the specifics of their projects, on trying to get support in ‘omics areas for a semantic web of linked concepts, built around simple, freely alterable ontologies; how doing this will help Google Scholar rank papers better. Well worth a look at the slides (worth while anyway, since I may well have got some of this wrong!). But Wilbanks ended with a resounding rallying cry: Turn those locks into gears! Don’t wait, start now.

Friday, 5 December 2008

Bryan Lawrence on metadata as limit on sustainability

Opening the Sustainability session at the Digital Curation Conference, Bryan Lawrence of the Centre for Environmental Data Archival and the British Atmospheric Data Centre (BADC), spoke trenchantly (as always) on sustainability with specific reference to the metadata needed for preservation and curation, and for facilitation for now and the future. Preservation is not enough; active curation is needed. BADC has ~150 real datasets but thousands of virtual datasets, tens of millions of files.

Metadata, in his environment, represents the limiting factor. A critical part of Bryan’s argument on costs relates to the limits on human ability to do tasks, particularly certain types of repetitive tasks. We will never look at all our data, so we must automate, in particular we must process automatically on ingest. Metadata really matters to support this.

Bryan dashed past an interesting classification of metadata, which from his slides is as follows:

A – Archival (and I don’t think he means PREMIS here: “normally generated from internal metadata”)
B – Browse: context, generic, semantic (NERC developing a schema here called MOLES: Metadata Objects for Linking Environmental Sciences)
C – Character and Citation: post-fact annotation and citations, both internal and external
D – Discovery metadata, suitable for harvesting into catalogues: DC, NASA-DIF, ISO19115/19139 etc
E – Extra: Discipline-specific metadata
O – Ontology (RDF)
Q – Query: defined and supported text, semantic and spatio-temporal queries
S – Security metadata

The critical path relates to metadata, not content; it is important to minimise the need for human intervention, and this means minimising the number of ingestion systems (specific processes for different data types and data streams), and to minimise the types of data transformations required (problem is validating the transformations). So this means that advice from data scientists TO the scientists is critical before creation; hence the data scientist needs domain knowledge to support curation.

Can choose NOT TO TAKE THE DATA (but the act of not taking the data is resource intensive). Bryan showed and developed a cost model based on 6 components; it’s worth looking at his slides for this. But the really interesting stuff was on his conclusions on limits, with 25 FTE:

"• We can support o(10) new TYPES of data stream per year.
• We can support doing something manually if it requires at most:
– o(100) activities of a few hours, or
– o(1000) activities of a few minutes, but even then only when supported by a modicum of automation.
• If we have to [do] o(10K) things, it has to be completely automated and require no human intervention (although it will need quality control).
• Automation takes time and money.
• If we haven’t documented it on ingestion, we might as well not have it …
• If we have documented it on ingestion, it is effectively free to keep it …
• … in most cases it costs more to evaluate for disposal than keeping the data.
• (but … it might be worth re-ingesting some of our data)
• (but … when we have to migrate our information systems, appraisal for disposal becomes more relevant)”

Interestingly, they charge for current storage costs for 3 years (only) at ingest time; by then, storage will be “small change” provided new data, with new storage requirements, keep arriving. Often the money arrives first and the data very much later, so they may have to be a “banker” for a long time. They have a core budget that covers administration, infrastructure, user support, and access service development and deployment. Everything changes next year however, with their new role supporting the Inter-Governmental Panel on Climate Change, needing Petabytes plus.

Thursday, 4 December 2008

International Digital Curation Conference Keynote

I'm not sure how much I should be blogging about this conference, given that the DCC ran it, and I chaired quite a few sessions etc. But since I've got the scrappy notes, I may try to turn some of them into blog posts. I've spotted blog postings from Kevin Ashley on da blog, and Cameron Neylon on Science in the Open, so far.

Excellent keynote from David Porteous, articulate and passionate supporter of the major “Generation Scotland” volunteer-family-based study of Scottish Health (notoriously variable, linked to money, diet, and smoking, and in some areas notoriously poor). Some interesting demographic images based on Scottish demography 1911, 1951, 2001 and projected to 2031, show the population getting older as we know. It’s not only the increasing tax burden on a decreasing proportion of workers, but also the rise of chronic disease. The fantastically-named Grim Reaper’s Road Map of Mortality shows the very unequal distribution, with particular black spots in Glasgow. About half of these effects are “nurture” and about half are “nature”, but really it’s about the interplay between them.

He spoke next about genetics, sequencing and screening. Genome sequencing took 13 years and cost $3B, now a few weeks and $500K, next year down to a few $K? Moving to a system where we hope to identify individuals at risk, start health surveillance, understand the genetic effects and target rational drug development, we hope reducing bad reactions to drugs.

Because of population stability, the health and aging characteristics, and a few legal and practical issues (such as cradle-to-grave health records), it turns out that Scotland is particularly suited to this kind of study. Generation Scotland is a volunteer, family-based study (illustration from the Broons!). There are Centres in Edinburgh, Glasgow, Dundee, Aberdeen; no-one should be more than an hour’s travel away etc. Major emphasis on data collection, curation and integration through an integrated laboratory management system, in turn linking to the health service and its records. Major emphasis on security and privacy. Consent is open consent (rather than informed consent), but all have a right to withdraw from the study, and must be able to withdraw any future use of their data (only 2 out of nearly 14,000 have withdrawn so far!).

This wasn’t a talk about the details of curation, but it was an inspiring example of why we care about our data, and how, when the benefits are great enough and the planning and careful preparation are good enough, even major legal obstacles can be overcome.

Sunday, 23 November 2008

Edinburgh IT futures workshop

I spent Friday morning at a workshop entitled “Research, wRiting and Reputation” at Edinburgh University. Pleasingly, data form quite a large part of the programme (although I’ll miss the afternoon talks, including Simon Coles from Southampton). First speaker is Prof Simon Tett of Geosciences talking about data curation and climate change. Problem is that climate change is slow, relative to changes in the observation systems, not to mention changes in computation and data storage. Baselines tend to be non-digital, eg 19th century observations from East India company shipping: such data need to be digitised to be available for computation. By comparison, almost all US Navy 2nd World War meteorological log books were destroyed (what a loss!). He’s making the point that paper records are robust (if cared for in archives) but digital data are fragile (I have problems with some of these ideas, as you may know: both documents and digital data have different kinds of fragility and robustness… the easy replicability of data compared with paper being a major advantage). Data should be looked after by organisations with a culture of long term curatorship: not researchers but libraries or similar organisations. The models themselves need to be maintained to be usable, although if continually used they can continue to be useful (but not for so long as the data). Not just the data, but the models need curation and provenance.

Paul Anderson of Informatics talking about software packages as research outputs. His main subject is a large scale UNIX configuration system, submitted by Edinburgh as part of the RAE. It’s become an Open Source system, so there are contributions from outside people in it as well. Now around 23K lines of code. Interesting to be reminded how long-lived, how large such projects can be, and how magically sustainable some of the underlying infrastructure appears to be. What would happen if SourceForge or equivalents failed? However, it does strike me that there are lessons to be earned by the data curation community from the software community, particularly from these large scale, long lived open source projects. Quite what these lessons are, I’m not sure yet!

Four speakers followed, three from creative arts (music composition and dance), and the fourth from humanities. The creative arts folks partly had problems with the recognition (or lack of recognition) of their creative outputs as research by the RAE panels. Their problems were compounded by a paucity of peer-reviewed journals, and the static, paper-oriented nature even of eJournals. The composer had established a specialist music publisher (Sumtone), since existing publishers are reluctant to handle electronic music. Interestingly, they offer their music with a Creative Commons licence.

The dance researchers had different problems. Complexity relates to temporality of dance. It’s also a young discipline, only 25 years or so, whereas dance is a very ancient form of art (although of course documentation of that is static). Dance notation is very specialised; many choreographers and many dancers cannot notate! Only a very small set of journals are interested, few books etc. Lots of online resources however. Intellectual property and performance rights are major issues for them, however.

The final researcher in this group was interested in integrating text with visual data and multimodal research objects. Her problems seemed to boil down to limitations of traditional journals in terms of the numbers and nature (colour, size, location etc) of images allowed; these restrictions themselves affected the nature of the argument she was able to make.

OK, these last few are at least partly concerned at the inappropriateness of static journals for their disciplines. Even 99% of eJournals are simply paper pages transported in PDF over the Internet. Why this still should be, 12 years after Internet Archaeology started publishing, as a peer-reviewed eJournal specifically designed to exploit the power of the Internet to enhance the scholarly content, beats me! Surely, just after the last ever RAE is exactly the time to create multimedia peer-reviewed journals to serve these disciplines. They’ll take time to get established, and to build up their citations and impact-factor (or equivalent). Does the new REF really militate so strongly against this? What else have I missed?

Thursday, 20 November 2008

Curation services based on the DCC Curation Lifecycle Model

I’ve had a go at exploring some curation services that might be appropriate at different stages of a research project. I thought it might also be worth trying to explore curation services suggested by the DCC Curation Lifecycle Model, which I also mentioned in a blog post a few weeks ago.

The model claims that it “can be used to plan activities within an organisation or consortium to ensure that all necessary stages are undertaken, each in the correct sequence… enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement”. So it does look like an appropriate place to start.

At the core of the model are the data of interest. These should absolutely be determined by the research need, but there are encodings and content standards that improve the re-usability and longevity of your data (an example might be PDF/A, which strips out some of the riskier parts of PDF and which should make the data more usable for longer).

Curation service: collected information on standards for content, encoding etc in support of better data management and curation.
Curation service: advice, consultancy and discussion forums on appropriate formats, encodings, standards etc.

The next few rings of the model include actions appropriate across the full lifecycle. “Description and Representation Information” is maybe a bit of a mouthful, but it does cover an important step, and one that can easily be missed. I do worry that participants in a project in some sense “know too much”. Everyone in my project knows that the sprogometer raw output data is always kept in a directory named after the date it was captured, on the computer named Coral-sea. It’s so obvious we hardly need to record it anywhere. And we all know what format sprogometer data are in, although we may have forgotten that the manufacturer upgraded the firmware last July and all the data from before was encoded slightly differently. But then 2 RAs leave and the PI is laid up for a while, and someone does something different. You get the picture! The action here is to ensure you have good documentation, and to explicitly collect and manage all the necessary contextual and other information (“metadata” in library-speak, “representation information” when thinking of the information needed to understand your data long term).

Curation service: sharable information on formats and encodings (including registries and repositories), plus the standards information above.
Curation service: Guidance on good practice in managing contextual information, metadata, representation information, etc.

Preservation (and curation) planning are necessary throughout the lifecycle. Increasingly you will be required by your research funder to provide a data management plan.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Community Watch and Participation is potentially a key function for an organisation standing slightly outside the research activity itself. The NSB Long-lived data report suggested a “community proxy” role, which would certainly be appropriate for domain curation services. There are related activities that can apply in a generic service.

Curation service: Community proxy role. Participate in and support standards and tools development.
Curation service: Technology watch role. Keep an eye on new developments and critically obsolescence that might affect preservation and curation.

Then the Lifecycle model moves outwards to “sequential actions”, in the lifecycle of specific data objects or datasets, I guess. As noted before, curation starts before creation, at the stage called here “conceptualise: Conceive and plan the creation of data, including capture method and storage options”. Issues here perhaps covered by earlier services?

We then move on to “doing it”, called here “Create and Receive”. Slightly different issues depending on whether you are in a project creating the data, or a repository or data service (or project) getting data from elsewhere. In the former case, we’re back with good practice on contextual information and metadata (see service suggested above); in the latter case, there’s all of these, plus conformance to appropriate collecting policies.

Curation service: Guidance on collection policies?

Appraisal is one of those areas that is less well-understood outside the archival world (not least by me!). You’ve maybe heard the stories that archivists (in the traditional, physical world) throw out up to 97% of what they receive (and there’s no evidence they were wrong, ho ho). But we tend to ignore the fact that this is based on objective, documented and well-established criteria. Now we wouldn’t necessarily expect the same rejection fraction to apply to digital data; it might for the emails and ordinary document files for a project, but might not for the datasets. Nevertheless, on reasonable, objective criteria your lovingly-created dataset may not be judged appropriate for archiving, or selected for re-use. My guess is that this would often be for reasons that might have been avoided if different actions had been taken earlier on. Normally in the science world, such decisions are taken on the basis of some sort of peer review.

Curation service: Guidance on approaches to appraisal and selection.

At this point, the Lifecycle model begins to take on a view as from a Repository/Data Centre. So whereas in the Project Life Course view, I suggested a service as “Guidance on data deposit”, here we have Ingest: the transfer of data in.

Curation service: documented guidance, policies or legal requirements.

Any repository or data centre, or any long-lived project, has to undertake actions from time to time to ensure their data remain understandable and usable as technology moves forward. Cherished tools become neglected, obsolescent, obsolete and finally stop working or become unavailable. These actions are called preservation actions. We’ve all done them from time to time (think of the “Save As” menu item as a simple example). We may test afterwards that there has been no obvious corruption (and will often miss something that turns up later). But we’ve usually not done much to ensure that the data remain verifiably authentic. This can be as simple as keeping records of the changes that were made (part of the data’s provenance).

Curation service: Guidance on preservation actions
Curation service: Guidance on maintaining authenticity
Curation service: Guidance on provenance

I noted previously that storing data securely is not trivial. It certainly deserves thoughtful attention. At worst, failure here will destroy your project, and perhaps your career!

Curation service: Briefing paper on keeping your data safe?

Making your data available for re-use by others isn’t the only point of curation (re-use by yourself at a later date is perhaps more significant). You may need to take account of requirements from your funder or publisher or community on providing access. You will of course have to remain aware of legal and ethical restrictions on making data accessible. The Lifecycle Model calls these issues “Access, Use and Reuse”.

Curation service: Guidance on funder, publisher and community expectations and norms for making data available
Curation service: Guidance on legal, ethical and other limitations on making data accessible
Curation service: Suggestions on possible approaches to making data accessible, including data publishing, databases, repositories, data centres, and robust access controls or secure data services for controlled access
Curation service: provision of data publishing, data centre, repository or other services to accept, curate and make available research data.

Combining and transforming data can be key to extracting new knowledge from them. This is perhaps situation-specific, rather than curation-generic?

The Lifecycle model concludes with 3 “occasional actions”. The first of these is the downside of selection: disposal. The key point is “in accordance with documented policies, guidance or legal requirements”. In some cases disposal may be to a repository, or another project. In some cases, the data are, or perhaps must be destroyed. This can be extremely hard to do, especially if you have been managing your backup procedures without this in mind. Just imagine tracking down every last backup tape, CD-ROM, shared drive, laptop copy, home computer version etc! Sometimes, especially secure destruction techniques are required; there are standards and approaches for this. I’m not sure that using my 4lb hammer on the hard drives followed by dumping in the furnace counts! (But it sure could be satisfying.)

Curation Service: Guidance on data disposal and destruction

We also have re-appraisal: “Return data which fails validation procedures for further appraisal and reselection”. While this mostly just points back to appraisal and selection again, it does remind me that there’s nothing in this post so far about validation. I’m not sure whether it’s too specific, but…

Curation service: Guidance on validation

And finally in the Lifecycle model, we have migration. To be honest, this looks to me as if it should simply be seen as one of the preservation actions already referred to.

So that’s a second look at possible services in support of curation. I’ve also got a “service-provider’s view” that I’ll share later (although it was thought of first).

[PS what's going on with those colours? Our web editor going to kill me... again! They are fine in the JPEG on my Mac, but horrible when uploaded. Grump.]

Comments on an obsolescence scale, please

Many people in this business know the famous Jeff Rothenberg quote: “It is only slightly facetious to say that digital information lasts forever--or five years, whichever comes first” (Rothenberg, 1995). Having just re-read the article, it’s clear that at that point he was particularly concerned about the fragility of storage media, something that doesn’t concern me for this post. What does concern me is the subject of much of the rest of his article: obsolescence of the data. There’s another very nice quote from the article, accepted gospel now but put so nicely: “A file is not a document in its own right--it merely describes a document that comes into existence when the file is interpreted by the program that produced it. Without this program (or equivalent software), the document is a cryptic hostage of its own encoding” (ibid).

Now the first Rothenberg quote is often assumed to apply in the second case, but there’s clear evidence emerging that data are readable well beyond 5 years. Most files on my hard disk that have been brought forward from previous computers (with very different operating systems) remain readable even after 15 years. Most, but not all; in my case, the exceptions are my early PowerPoint files, created with PowerPoint 4 on a Mac in the 1990s, unreadable with today’s Mac PowerPoint version. However, these files are not completely inaccessible to me, as a colleague with a Windows machine has a different version of PowerPoint that can read these files and save them in a format that I can handle. It’s a comparatively simple, if tedious, migration.

I have spent some time and asked in various fora for evidence of genuinely obsolete, that is completely inaccessible data (Rusbridge, 2006). There are some apocryphal stories and anecdotes, but little hard evidence so far. But perhaps complete inaccessibility isn’t the point? Perhaps the issue is more about risk to content, on the one hand, or extent of information loss on the other? Maybe this isn’t a binary issue (inaccessible or not), but a graded obsolescence scale? Is there such a thing? I couldn’t find one (there’s a hint of one in a NTIS report abstract (Slebodnick, Anderson, Fagan, & Lisez, 1998), but the text isn’t online, and it appears to be about obsolescence of people)!

So here are a couple of attempts, for comment.

Here’s what might be called a Category approach. Users would be asked to select one only from the following categories (it worries me that they are neither complete nor non-overlapping!):

A Completely usable, not deemed at risk
Open definition with no significant proprietary elements
multiple implementations, at least one Open Source
high quality treatment in secondary implementations
widespread use
B Currently usable, at risk
Closed definition, or
Open definition, but with significant proprietary elements
More than one implementation
imperfect treatment in secondary implementations
C Currently usable, at significant risk
Proprietary/closed definition
single proprietary implementation
Not widespread use
D Currently inaccessible, migration path known
Proprietary tools don't run on current environment
Potentially imperfect migration
E Completely inaccessible
no known method of extraction or interpretation

Here’s a more subjective, but perhaps more complete scale. Users are asked to select a number from 1 to 5 representing where they assess their data lies, on a scale whose ends are defined as

1 Completely usable, not deemed at risk, and
5 Completely inaccessible

And here’s a more Likert-like scale. Users are asked to nominate their agreement with a statement such as “My data are completely inaccessible”, using the Likert scale:

1 Strongly disagree
2 Disagree
3 Neither agree nor disagree
4 Agree
5 Strongly Agree

Hmmm. How could you neither agree nor disagree with that statement? “Excuse me, I haven’t a clue”?

Comments on this blog, please; but if you’re commenting elsewhere, can you use the tag “Obsolescence scale”, and maybe drop me a line? Thanks!

Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. (Accessed through Ebscohost)
Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? [Electronic Version]. Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.
Slebodnick, E. B., Anderson, C. D., Fagan, P., & Lisez, L. (1998). Development of Alternative Continuing Educational Systems for Preventing the Technological Obsolescence of Air Force Scientists and Engineers. Volume I. Basic Study.

Wednesday, 19 November 2008

Research Data Management email list created

At JISC's request, we have created a new list to support discussion amongst those interested in management of research data. The list is Research-Dataman@jiscmail.ac.uk, and can be joined at http://www.jiscmail.ac.uk/lists/RESEARCH-DATAMAN.html. The list description is:

"List to discuss the data management issues arising in and from research projects in UK Higher Education and its partners in the UK research community and internationally, established by the Digital Curation Centre on behalf of the JISC."

I was specifically asked not to include the word "curation" in the description, in case it frightened people off! And yes, it's already been hinted to me that the list-name is sexist. Honest, it really was just short for research data management. But you knew that (;-)!

Not much traffic yet, but this is the first time it has been publicised. Please spread the word!

Monday, 17 November 2008

Keeping the records of science accessible: can we afford it?

There's an excellent summary available of the 2008 conference of the Alliance for Permanent Access to the Records of Science Conference, with the above title. It was in Budapest on 4 November, 2008. I wasn't able to go, unfortunately, but it looks like it might have been pretty interesting.

Just sharing

Like several others in the blogosphere, I really enjoyed Scott Leslie’s post on EdTechBlog: Planning to Share versus Just Sharing. It is in the learning domain but I was glad that Andy Powell picked up its relevance to repositories in his follow-up on eFoundations.

Perhaps the key points that Andy picks up are

“The institutional approach, in my experience, is driven by people who will end up not being the ones doing the actual sharing nor producing what is to be shared. They might have the need, but they are acting on behalf of some larger entity.” and

“because typically the needs for the platform have been defined by the collective’s/collaboration’s needs, and not each of the individual users/institutions, what results is a central “bucket” that people are reluctant to contribute to, that is secondary to their ‘normal’ workflow…”

I’ve written before about fitting into the work flow, and endorse this concern.

I would also point to this quote:

“the whole reason many of us are so attracted to blogs, microblogs, social media, etc., in the first place is that they are SIMPLE to use and don’t require a lot of training.”

I suppose the equivalent for the repository world’s sphere of interest (usually, the scholarly article) is the academic’s web page. Somehow, for many (not all) academics, putting papers up on their web page does seem to fit in their normal work flow. They don’t worry about permissions, about the fine print of agreements they didn’t even read, they just put up their stuff to share. Is that a Good Thing? You betcha! Is it entirely a Good Thing? No, there are problems, for example those that arise in the medium term, when the academic moves, retires, or dies.

Looking at it another way, however, why doesn’t the collective repository infrastructure get credit for “Just Sharing”?

What about data? Well, getting data to be available for re-use is a Good Thing, and if there’s a simple infrastructure (eg the academic’s web page), that’s certainly better than leaving it to languish and fade away on some abandoned hard drive or (worse) CD-ROM. My feelings of concern are certainly elevated for data, however; don’t repositories, data services etc act as domain proxies in encouraging appropriate standardization? Well, I suppose they might, but maybe Leslie is right that the intra-domain pressures that would arise from Just Sharing would be even stronger. “It would be much easier for me to re-use your data (and give you a citation for it) if it were consistent with XXXX” is a stronger motivator than “as repository managers, we believe these data should be encoded with XXXX”.

Data repositories (and they can be at the lab or even personal level) should, of course, encourage the collection of contextual and descriptive information alongside the actual data. And again, their potential longevity brings advantage later, as technology change begins to affect usability of the data.

Now I guess the cheap shot here is to point to UKRDS as an example that is spending years Planning to Share, rather than Just Sharing. But what is the Just Sharing alternative to UKRDS?

Project data life course

This blog post is an attempt to explore the “life course” of an arbitrary small to medium research project with respect to data resources involved in the project. (I want to avoid the term life cycle, since we use this in relation to the actual data.)

This seems like a useful exercise to attempt, even if it has to be an over-generalisation, as understanding these interactions may help to define services that support projects in managing and curating their data better, and a number of such services are suggested here. This is about “Small Science”; James M. Caruthers, a professor of chemical engineering at Purdue University has claimed ‘Small Science will produce 2-3 times more data than Big Science, but is much more at risk’ (Carlson, 2006). Large and very large projects tend to have very particular approaches, and probably represent significant expertise; I’ll not presume to cover them.

Is this generalisation way off the mark? I’d really like to know how to improve it, accepting that it is a generalisation!

The overall life course of a project (I suggest) is roughly:

pre-proposal stage (when ideas are discussed amongst colleagues, before a general outline of a proposal is settled on), followed by
the proposal stage (constrained by funder’s requirements, and by the need to create a workable project balancing expected science outcomes, resources available, and papers and other reputational advantages to be secured). If the project is funded, there follows
an establishment phase, when the necessary resources are acquired, configured and tested, (which overlaps and merges with)
an execution phase when the research is actually done, and
a termination phase as the project draws to a close.
Somewhere during this process, one or more papers will be produced based on the research understandings gained, linked to the data collected and analysed.

My message here is that curation is most impacted by decisions in the first 3 phases, even though it requires attention throughout!

Pre-proposal stage
At the pre-proposal stage, you and your potential partners will mostly be concerned with working out if your research idea is likely to be feasible, how and with roughly what resources. During these conversations, you will need to identify any external or already existing data resources that will be required to enable the project to succeed. You should also start to identify the data that the project would create, or perhaps update, and how these data will support the research conclusions. In doing so, attention must be paid to resource constraints, not necessarily financial, but also legal and ethical issues affecting use of the data. However, discussions are more likely to be generic rather than focused! Thus begins the impact on curation…

Curation service: 5-minute curation introduction, with pointers to more information…
Curation service: Help Desk service. Ask a Curator?

Proposal stage
Apart from the obvious focus on the detailed plan for the research itself, this stage is particularly constrained by funder requirements. Increasingly, these requirements include a data management plan. This is unlikely to be a problem for large projects, which will already have spent some time (often lots of time) considering data management issues, but may well be a new requirement for smaller projects. However, forcing attention to be paid to data management is definitely a Good Thing! I’ll have a closer look at data management plans in a later blog post.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Key issues that the project should address, and which should be reflected in the data management plan, include the standards and encodings for analysing, managing and storing data. While these must be determined by a combination of local requirements, such as the availability of licences, skills and experience, and the local IT environment, they must take account of community norms and standards. While in some cases maverick approaches can lead to breakthrough science, they can also lead to silo data, inaccessible and not re-usable (sometimes not re-usable even by their creators, after a short while). This stage begins to have a significant effect on curation, as the decisions that are fore-shadowed here will have major effects later on. Repeat after me: curation begins before creation!

Curation service: Guidance on managing data for re-use in the medium to long term
Curation service: registries and repositories to support re-use, sharing and preservation
Curation service: pointers to relevant standards, vocabularies etc.

[Note, we have concerns about the scalability of the latter, which is why the DCC DIFFUSE service is being re-modelled to be more participative.]

Establishment stage
It’s during the establishment phase that the real work is in sight. In the case of a well-established laboratory or research centre, this may merely be “yet another project” to be fitted into well-known procedures, perhaps with some adaptations. But from my observation and anecdotally, there is often a high degree of individual idiosyncratic decision-making taking place here; the focus is often on “getting down to the research”. The consequences of this for curation may be very significant, but are unlikely to be apparent until much later.

The key curation aim is to set the data management plan into operation. This requires ensuring access to external data required by the project (an activity that should have been started, at least in principle, ahead of project approval). While you’re thinking of access, rights and ownership, negotiating and documenting data ownership and access agreements for the data created in the project would be a smart idea (too often ignored, or left until problems begin to arise). Of course, it’s not just rights that are at issue, but also responsibilities, particularly where data have privacy and ethical implications.

Curation service: Briefing papers on legal, ethical and licence issues.

Meanwhile a critical step is establishing the IT infrastructure. While the focus is likely to be on acquiring and testing the data acquisition and analysis equipment, software, capabilities and workflows, in practice the apparently background activities of ensuring adequate data storage, sensible disaster recovery procedures (or at the very least, proper data backup and recovery), and thoughtful data curation processes will be critical for later re-use.

Curation service: Briefing paper on Keeping your data safe?
Curation service: tools for various tasks, eg data audit, risk assessment

It’s worth remembering that there are different kinds of data storage. While you will often be most concerned with temporary or project-lifetime storage, you may need to think about storage with greater persistence. There may be a laboratory or institutional or subject repository, external database or data service that will be critical to your project; negotiating with the curation service providers associated with these services should be an important early step. In effect, right at the start you are planning for your exit! But more on this another time…

Curation service: Guidance on data repositories for projects?
Curation service: thought-stimulating discussion papers, blog posts and case studies, journal, conference, discussion fora…

Project execution
Through the lifetime of your project, your focus will rightly be on the research itself. But you should also take time out to test your disaster recovery procedures, and check on your quality assurance. You care about the quality of your materials and syntheses (or your algorithms, or your documentary sources), you need to care as much about the quality of your data, both as-observed and as-processed. You’ll be collecting a lot of data; if some support your hypothesis and some contradict it, you’ll need to ensure you are collecting (perhaps in laboratory notebooks or equivalent) sufficient provenance and context to justify the selection of the former rather than the latter.

Not sure if there is a curation service here, or if these are good practice research guides for your particular domain… but I guess the same could be said for standards and vocabularies. Question is whether domains HAVE good practice guides for data quality?

As the project goes forward, issues may arise in your original data management plan, and you should take the time to refine it (and document the changes), and your associated curation processes. If you did take idiosyncratic decisions in haste at project establishment, at some point you may need to review them. It’s a warning sign when a staffing change is followed by proposals for major architectural changes!

Finally, at some stage you should test any external deposit procedures built into your plan.

Curation service: Guidance on data deposit…

I don’t think I’m claiming that this is the limit of curation issues during project execution; rather that here the course of projects is so varied that generalisations are even less justifiable than elsewhere. Nevertheless, the analysis here should be strengthened, so input would be helpful!

Project termination stage
As the project comes towards an end, you enter another risk phase. Firstly, most management focus is likely to be on the next project (or the one after that!). Secondly, if employment is fixed term, tied to the particular project, then research staff will have been looking for other posts as the end of the project draws near, and the project may start to look denuded (and the data management and informatics folk may be the most marketable, leaving you exposed). And of course, the final reports etc cannot be written until the project is finished, by which time the project’s staffing resources will have disappeared. It’s the PI’s job to ensure these final project tasks are properly completed, but it’s not surprising that there are problems sometimes!

Curation service: Guidance on (or assistance with) evaluation

During this final phase, you must identify any data resources of continuing value (this should have occurred earlier, but does need re-visiting), and carry out the plans to deposit them in an appropriate location, together with the necessary contextual and provenance information. Easy, hey? Perhaps not: but failure here could have a negative impact on future grants, if research funders do their compliance checking properly.

Curation service: Guidance on appraisal, Guidance on data deposit.

Finally, given that data management plans are currently in their infancy, it’s worth commenting on the plan outcomes in your project final report!

Writing papers
This is not really a project phase as such, as often several papers will be produced at different stages of the project. The key issues here are:

Include supplementary data with your paper where possible
Ensure data embedded in the text are machine readable
Cite your data sources
Ensure supportive data are well-curated and kept available (preferably accessible; this is where a repository service may come in useful).

I’ll write more on this in a separate post.

Curation service: Proposals for data citation
Curation service: Suggestions for the Well-supported article

So: the message is that curation should be a constant theme, albeit as background during most of the project execution. But the decisions taken during pre-proposal, proposal and establishment phases will have a big effect on your ability to curate your data, which may affect your research results, and will certainly affect the quality and re-usability of the data you deposit.

Carlson, S. (2006, June 23). Lost in a Sea of Science Data: Librarians are called in to archive huge amounts of information, but cultural and financial barriers stand in the way. The Chronicle of Higher Education, 52, 35.

Friday, 14 November 2008

DCC White Rose day

Fresh from a fascinating day in Sheffield, organised by the DCC and the White Rose e-Science folk. Objectives for the day included building closer relationships between White Rose and DCC, helping us learn more about their approach to e-Science, identify data issues, and influence the DCC agenda!

After introductory remarks, the day started with Martin Lewis, Sheffield Librarian, talking about the UK Research Data Service (UKRDS) feasibility study. Martin was keen to emphasise the last two words; it’s important to manage expectations here, and we’re a long way from an operational service. There’ll be a conference to discuss the detailed proposals in February 2009, after which they hope to get funding for several Pathfinder projects. Martin and I argued ferociously, as we always do, but I should be clear that this has the potential to be an extremely valuable development. The really difficult conundrum is getting the balance of sustainability, scalability, persistence and domain science involvement in the curation. Oh, plus an appealing pitch to the potential funders. Not at all an easy balance!

Then we heard from Dr Darren Treanor from Leeds about their virtual slide library for histopathology; he claims it as one of the largest in the world, with over 40K slides digitised, even if only a thousand or so are fully described and publicly available. At 10 Gb or more a slide, these are not small; not just storage but bandwidth becomes a very real issue when accessing them. We had a brief discussion about technologies that optimise bandwidth for view very large or high resolution images; think Google Earth as one example, or the amazing Seadragon, created by Blaise Aguera y Arcas and now owned by Microsoft (the technology appears to be incorporated into Photosynth, but I can’t explore it, as it doesn’t work in my OS!).

Sarah Jones from the DCC and HATII at Glasgow spoke about the Data Audit Framework, now being piloted at 4 sites. This is a tool aimed at helping institutions to discover their research data resources, and begin to understand the extent to which they are at risk. More about that one later, I think!

Then we heard about 3 more very different projects, although Virtual Vellum probably had more in common with the virtual slide library than the presenter would have previously expected. This project shows digitised and cleaned up medieval manuscripts, with their TEI-based transcription in parallel. We heard an entertaining rant against SRB (wrong tool for the job, I think), but more worryingly that these projects were strongly constrained by IPR problems (yes, the digitised versions of half-millennium-old manuscripts do acquire new copyright status, and can also be ringed about with contractual restrictions). So much for open scholarship!

We also heard about the CARMEN project, aiming to support sharing and collaboration in neuroscience. Graham Pryor will be releasing a report from his involvement with CARMEN before too long, so I won’t dwell on it, other than to say it is a really good project that has thought a lot of issues through very carefully. It also has one decent and one cringeworthy acronym: MINI, for the Minimum Information for a Neuroscience Investigation, and MIAOWS, for Minimum Information About a Web Service… but maybe Frank Gibson was pulling our legs, since I didn’t find it on a Google search, nor is it in the MIBBI (Minimum Information for Biological and Biomedical Investigations) Portal!

Finally we heard about the University of York Digital Library developments; this is not so much a project as the development phases of a service. They've choesn Fedora plus Muradora as their technology platform, but it was clear again that this was a thoughtful project, recognising that there's much more to it than "build it and they will come".

This was a really interesting day; I hope we gave good value; I certainly learned a lot. Thanks everyone, especially Dr Joanna Schmidt, who organised it with Martin Donnelly.

Thursday, 6 November 2008

Data publishing and the fully-supported paper

Cameron Neylon’s Science in the Open blog is always good value. He’s been posting installments of a paper on aspects of open science, and there’s lots of good stuff there. Of course, Cameron’s focus is indeed on open science rather than data, but data form a large part of that vision. In part 3:

“Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper.”

“In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ’scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data.”

In part 4, Cameron discusses the idea of the “fully supported paper”.

“This is a concept that is simple on the surface but very complex to implement in practice. In essence it is the idea that the claims made in a peer reviewed paper in the conventional literature should be fully supported by a publically accessible record of all the background data, methodology, and data analysis procedures that contribute to those claims.” He suggests that “the principle of availability of background data has been accepted by a broad range of publishers. It is therefore reasonable to consider the possibility of making the public posting of data as a requirement for submission.”

Cameron does remind us

“However the data itself, except in specific cases, is not enough to be useful to other researchers. The detail of how that data was collected and how it was processed are critical for making a proper analysis of whether the claims made in a paper to be properly judged.”

This is the context, syntax and semantics idea in another guise. Of course, Cameron wants to go further, and deal with “the problem of unpublished or unsuccessful studies that may never find a home in a traditional peer reviewed paper.” That’s part of the benefit of the open science idea.

Wednesday, 5 November 2008

Some interesting posts elsewhere

I’m sorry for the gap in posting; I’ve been taking a couple of weeks of leave at the end of my trip to Australia. Since return I’ve been catching up on my blog reading, and there are some interesting posts around.

A couple of people (Robin Rice and Jim Downing in particular) have mentioned the post Modelling and storing a phonetics database inside a store, from the Less Talk, More Code blog (Ben O'Steen). This is a practical report on the steps Ben took to put a database into a Fedora Commons-based repository. He details the analysis he went through, the mappings he made, the approaches to capturing representation information, to making the data citable at different levels of granularity, and an interesting approach that he calls “curation by addition”, which appears to be a way of curating the data incrementally, capturing provenance information of all the changes made. It’s a great report, and I look forward to more practical reports of this nature.

Quite a different post on peanubutter (whose author might be Frank Gibson): The Triumvirate of Scientific Data discusses ideas that he suggests relate to the significant properties of science data. His triumvirate comprises

"content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean?"

Oddly, the discussion associated with this blog post is on Friendfeed rather than associated with the blog itself. Very interesting to see the discussion recorded like that, and in the process see at least one sceptic become more convinced!

To me, there seemed to be strong resonances between his argument and some of the OAIS concepts, particularly Representation Information. However, context, syntax and semantics might be a more approachable set of labels than RepInfo!

Friday, 31 October 2008

Interactive Science Publishing

Open Access News yesterday reported the launch of a new DSpace-based repository for article-related datasets by the Optical Society of America. They're calling this "Interactive Science Publishing" (ISP). The catch? The repository requires a closed-source download to get hold of the data, and the use of further closed-source software to view/manipulate it.

So, one step forward, one step back? Read the original post here.

Tuesday, 14 October 2008

Open Access Day

It's Open Access Day today. I've been at the Australian ARROW Repositories Day, and no-one mentioned it.

Why do I care about Open Access? It's always seemed obvious to me since I first heard of the idea. I've never cared what colour it was, nor how it was done. But I've not published in a toll publication since that day, except for a book chapter where I reserved the rights and put it in the repository. It's fair, it's just; you pay my wages, and you should be able to know what you're getting. But it's easy for me; risk is low, career not at stake, and publishing this way is part of my job. I understand why people with more at stake are more cautious. It's a long road to full acceptance, and we're not doing badly.

Hobbes Leviathan: "the life of man, solitary, poor, nasty, brutish, and short". Could be a metaphor for open access repositories? Nah. On the other hand, if you start from there, hey, everything begins to look rosy (;-)!

Read some nice posts on various blogs about OA day, some quite long and well-written. I'm keeping this short and nearly empty, like so many repositories. Feels better that way. Sigh...

ARROW Repositories day: 3

Dr Alex Cook from the Australian Research Council (a money man! Important!) talking on the Excellence in Research Framework (ERA), the Access Framework and ASHER. ERA appears to be like the UK’s erstwhile RAE, and will use existing HE Research Data Collection rules for publication and research income information where possible. 8 clusters of disciplines have been identified. Currently looking at the bibliometric and other indicators which will be discipline-specific (principles, methodologies and a matrix showing which are used where). Developing the System to Evaluate Excellence of Research (SEER) that will involve institutions uploading their data; sounds like something that MUST (and they say will) work with repositories (RAE in UK too often went for separate bibliographic databases, not interoperable, which meant doing the same thing twice). Copyright still an issue (I wonder if they are brave enough to take a NIH mandate approach? See below… not so explicit, but some pressure that way). Where a research output is required for review purposes, institutions will be required to store and reference their Research Output Digital Asset. Repositories a natural home for this.

Accessibility framework requires research outputs to be made sufficiently accessible to allow maximum use of the research. ASHER a short term (2 year?) funding stream to help institutions make their systems and repositories more suitable for working in this new framework.

Andrew Treloar talking about ANDS, the Australian National Data Service, and its implications for repositories. Mentions again the Australian Code for the Responsible Conduct of Research. Institutions need to think about their obligations under this code, obligations that are quite significant! Good story about Hubble telescope: data must be made available at worst 6 months after capture; most of published data from Hubble is not “first use”! (Would this frighten researchers? I put all the work in, but someone else just does the analysis and gets the credit?)

Structure of ANDS: developing frameworks (eg encouraging moves towards discipline-acceptable default sharing practices), providing utilities (building and delivering national technical services to support the data commons, eg discovery, persistent identifiers, and collections registry), seeding the commons, and building capabilities, plus service development activities. ISO 2146 information model important here (I think I’ve already talked about this stuff from the iPres posts [update: no, it was from the e-Science All Hands Meeting, but the links still points to the right place]).

Australian Strategic Roadmap Review talks about national data fabric, based on institutional nodes, and a coordination component to integrate eResearch activities, and see expertise as part of infrastructure. There’s also a review of the National Innovation System: ensure data get into repositories, try to get data more freely available.

Implications for repositories: ANDS dependent on repositories, but doesn’t fund repositories! May need range of repository types for storing data. Big R and little r repositories: real scale issues in data repositories. Lots of opportunities for groups to take part in consultation etc. Different but related discovery service. Links to Research Information Systems. Persistent Identifier Service (in collaboration with Digital Education Revolution). (I worry about this: surely persistence requires local commitment, rather than remote services?)

Panel discussion… question about CAIRS, the Consortium of Australian Institutional Repositories, funding left over from ARROW, being run by CAUL, the Australian University Librarians, which now has an Invitation to Offer to find someone to run the proposed service.

Some questions about data, whether they should be in repositories or in filestore. Scale issues (numbers, size, rapidity of deposit and re-use etc) suggest filestore, but there are then maybe issues about integrity and authenticity. There will be ways of fixing these, but they imply new solutions that we don’t yet have. The data world will soon go beyond the current ePrint/DSpace/Fedora realms.

A few more questions, too hard to summarise, on issues such as the importance of early career scientists, on the nature of raw data etc. But overall, an extremely interesting day!

ARROW Repositories day: 2

Lynda Cheshire speaking as part of “the researcher’s view”, talking about the view from a qualitative researcher working with the Australian Social Science Data Archive (ASSDA), based at ANU, established 1981, with about 3,000 datasets. Most notable studies election studies, opinion polls and social attitudes surveys, mostly from government sources. Not much qualitative data yet, but have grants to expand this, including new nodes, and the qualitative archive (AQuA) to be at UQ. Not just the data but tools as well, based on existing UK and US qualitative equivalents.

Important because much qualitative data is held by researchers on disk, in filing cabinets, lofts, garages, plastic bags! Archiving can support re-use for research, but also for teaching purposes. Underlying issues (she says) are epistemological and philosophical. Eg quantitative about objective measurements, but qualitative about how people construct meaning. Many cases (breadth) vs few cases (depth). Reliability vs authenticity. Detached vs involved.

Recent consultation through focus groups: key findings included epistemological opposition to qualitative archiving (or perhaps re-use), because of loss of context; data are personal and not to be shared (the researcher-subject implied contract); some virtues of archiving were recognised; concerns about ethical/confidentiality challenges; challenges of informed consent (difficult as archiving might make it harder to gather extremely sensitive data, but re-use might avoid having to interview more people about traumatic events); whose data is it (the subject potentially has ownership rights in transcripts, while the researcher’s field notes potentially include personal commentary); access control and condition issues; additional burden of preparing the data for deposit.

The task ahead: develop preservation aspects (focus on near retirees?), and data sharing/analysis under certain conditions. Establish protocols for data access, IPR, ethics etc. Refine ethical guidelines. Assist with project develop to integrate this work.

Ashley Buckle from Monash on a personal account of challenges for data-driven biomedical research. Explosion in amount of data available. Raw (experimental) data must be archived (to reproduce the experiment). Need for standardised data formats for exchange. Need online data analysis tools to go alongside the data repositories. In this field, there’s high throughput data, but also reliable annotation on low volume basis by expert humans. Federated solutions as possible approaches for Petabyte scale data sets.

Structural Biology pipeline metaphor; many complex steps involved in the processes, maybe involving different labs. Interested in refolding phase; complex and rate-limiting. They built their own database (REFOLD), with a simple interface for others to add data. Well-cited, but few deposits from outside (<1%). Spotted that the database was in some ways similar to a lab note-book, so started building tools for experimentalists, and capture the data as a sideline (way to go, Ashley!). Getting the data out of journals is inadequate. So maybe the journal IS the database? Many of the processes are the same.

Second issue: crystallography of proteins. Who holds the data? On the one hand, the lab… but handle it pretty badly (CDs, individuals’ filestore, etc). Maybe the Protein Data Bank? But they want the refined rather than the raw data. Maybe institutional libraries? TARDIS project providing tools for data deposit and discovery, working with ARCHER, ARROW and Monash library... This field does benefit from standards such as MIAME, MIAPE etc, which are quite important in making stuff interoperable. Ashley's working with Simon Coles etc in the UK (who's mostly at the small molecule end).

So how to go forward? Maybe turning these databases into data-oriented journals, with peer review built in etc would be a way to go? Certainly it's a worry to me that the Nucleic Acids field in general lists >1,000 databases; there has to be a better way than turning everything into Yet Another Database...

ARROW Repositories day: 1

I’ve been giving a talk about the Research Repository System ideas at the ARROW repository day in Brisbane, Australia (which is partly why there has been a gap in posting recently). Here are some notes on the other talks.

Kate Blake from ARROW is talking about metadata. Particularly important for data, which cannot speak for itself. Metadata thought of as a compound object that comprises some parts for “library management” issues (things like author, title, keyword) for the whole document and/or its parts, plus University management parts, such as evidence records for research quality management purposes. These link to metadata that applies to the community of practice, eg the appropriate metadata for an X-ray image. Have the content (maybe a PDF), its rich metadata (Kate used MARC/XML as an example, which surprised me, since she also suggested this group was specific to the content), lightweight descriptive metadata, technical metadata (file size, type etc), administrative metadata, eg rights or other kinds of institutional metadata, preservation metadata such as PREMIS, and both internal and external relationship metadata. METS is one way to wrap this complex set of metadata and provide a structural map (there are others). (Worrying that this seems like a very large quantity of metadata for one little object…) Aha, she’s pointing out that aggregating these into repositories and these repositories together across search services leads ot problems of duplication, inconsistency, waste of effort, etc. So lots of work trying to unravel this knot, lots of acronyms: RDF, FRBR, SWAP, DCMI AM, SKOS etc…

FRBR making the distinction between the work, its expressions, manifestations and items. SWAP being a profile for scholarly works, as text works, not much use for data.

Names appear in multiple places in the metadata, and different parts have different rules. Do we have a name agent service (registries)? Need services to gather metadata automatically, that way you might introduce consistency and interoperability.

Kylie Pappalardo from QUT’s OAK Law project on legal issues on managing research data so it can be included in a repository and accessed by others. Government statements in favour of openness (eg Carr: More than one way to innovate, also “Venturous Australia” strategy). To implement these policies we need changes to practice and culture, institutional engagement, legal issues being addressed, etc. Data surrounded by law (!): copyright, contract, patents, policies, confidentiality, privacy, moral rights. Conflicting legal rights: who can do what with the data? QUT has OAK Law and also Legal Framework for e-Research project.

Survey online, May 2007, 176 participants responded. 50 depositing data in database; of those 46% said available openly, 46% required some or complete restrictions. 54% said their organisation did NOT have a data policy at all; where they did have a policy, most were give guidelines. 55% said they prepared plans for data management; two thirds of these at time of proposal, balance later. Should be early, not least because data management costs and should be part of the proposal, also disputes can be hard to resolve later. 7% felt that clearer info on sharing and re-use would help, and 90% wanted a “plain English” guide (who wouldn’t?). Lawyer language doesn’t help, so researchers make their own informal agreements… maybe OK if nothing goes wrong.

The group has a report: analysis of Legal Context of Infrastructure for Data Access and Re-use in Collaborative Research. Also Practical Data Management: a Legal and Policy Guide. They have some tools, including a “simple” Data Management (legal) toolkit to fill in to gather information about (eg) copyright ownership etc.

Peter Sefton of USQ talking about OAI-ORE, and what it can do for us. Making the point that we build things from a wide variety of standard components, which (mostly) work pretty well together, eg found bricks in a garden wall… OAI-PMH mostly works, moving metadata from one place to another. But it’s just the messenger, not the message. So a harvest of metadata across multiple repositories shows wide variations in the keywords, subjects etc. Problems with XACML for defining access policies: no standardisation on the names of subgroups, so in the end it’s no use for search. Point being that these standards may appear important but not work well in practice.

So on to ORE (Object Re-use and Exchange)… Pete asks “shouldn’t it be exchange, then re-use?”. ORE view of a blog post: a resource map describes the post, but in fact it’s an aggregation (compound object) of HTML text, a couple of images, comments in separate HTML, etc. The aggregation does have a URI, but does not have a fetchable reality (the resource map does). Can get complex very rapidly. See Repository Challenge at OpenRepositories 2008 in Southampton, and the ORE Challenge at RepoCamp 2008. USQ participating with Cambridge in JISC-funded TheOREM project, also internal image project called The Fascinator. Based on his ICE system that has been mentioned before, integrated with ORE tools to push stuff into the repository. Can have the repository watching so it can get it for itself.

ORE can: supplement OAI-PMH for moving content around; improve research tols like Zotero; replace use of METS packages; allow “thesis by publication” more elegantly; and pave the way for a repository architecture that understands content models (no more discussion of atomistic versus compound objects).