Digital Curation Blog: December 2008

Tuesday, 30 December 2008

Gibbons, next generation academics, and ir-plus

Merrilee on the hangingtogether blog, with a catchup post about the fall CNI meeting, drew our attention to a presentation by Susan Gibbons of Rochester, on studying next generation academics (ie graduate students), as preparation for enhancements to their repository system, ir-plus.

I shared a platform with Susan only last year in Sydney, and I was very impressed with their approach of using anthropological techniques to study and learn from their user behaviour. In fact, her Sydney presentation probably had a big influence on the ideas for the Research Repository System that I wrote about earlier (which I should have acknowledged at the time).

Now their latest IMLS-funded research project takes this even further, as preparation for collaborative authoring and other tools integrated with the repository. If you're interested in making your repository more relevant to your users, the report on the user studies they have undertaken so far is a Must-Read! I don't know whether the code at http://code.google.com/p/irplus/ is yet complete...

Friday, 19 December 2008

The 12 files of Christmas

(Well, half a dozen, maybe.) OK, it’s Christmas, or the Holiday Season if you really must, and I may have a present for you!

My readers will know that I have an interest in obsolete and obsolescent data and files. I’ve tried in the past to get people to tell me about their stuff that had become inaccessible, with only limited success. So, after our work Christmas lunch, with the warm glow of turkey and Christmas pudding still upon me, and a feeling of general goodwill, I thought I’d offer you an opportunity to give me a challenge.

I will do my best to recover the first half dozen interesting files that I’m told about… of course, what I really mean is that I’ll try and get the community to help recover the data. That’s you!

OK, I define interesting, and it won’t necessarily be clear in advance. The first one of a kind might be interesting, the second one would not. Data from some application common of its time may be more interesting than something hand-coded by you, for you. Data might be more interesting (to me) than text. Something quite simple locked onto strange obsolete media might be interesting, but then again it might be so intractable it stops being interesting. We may even pay someone to extract files from your media, if it’s sufficiently interesting (and if we can find someone equipped to do it).

The only reference for this sort of activity that I know of is (Ross & Gow, 1999, see below), commissioned by the Digital Archiving Working Group.

What about the small print? Well, this is a bit of fun with a learning outcome, but I can’t accept liability for what happens. You have to send me your data, of course, and you are going to have to accept the risk that it might all go wrong. If it’s your only copy, and you don’t (or can’t) take a copy, it might get lost or destroyed in the process. You’ll need to accept that risk; if you don't like it, don't send it. I might not be able to recover anything at all, for many reasons. I’ll send you back any data I can recover, but can’t guarantee to send back any media.

The point of this is to tell the stories of recovering data, so don’t send me anything if you don’t want the story told. I don’t mind keeping your identity private (in fact good practice says that’s the default, although I will ask you if you mind being identified). You can ask for your data to be kept private, but if possible I’d like the right to publish extracts of the data, to illustrate the story.

Don’t send me any data yet! (It’s like those adverts: send no money now!) Send me an email with a description of your data. I’m not including my email address here, but you can find it out easily enough. Include “12 files of Christmas” in your subject line. Tell me about your data: what kind of files, what sort of content, what application, what computer/operating system, what media, roughly what size. Tell me why it’s important to you (eg it’s your thesis or the data that support it, your first address book with long-lost relative’s address on it, etc). Tell me if (no, tell me that) you can grant me the rights to do what’s necessary (making copies or authorising others to make copies, making derivative works, etc).

Is that all? Oh, we need a closing date… that had better be twelfth night, so 11:59 pm GMT on 5 January, 2009 (according to Wikipedia)!

Any final caveats? Yes; I have not run this past any legal people yet, and they may tell me the risks of being sued are too great, or come up with a 12-page contract you wouldn’t want to sign. If that happens, I’ll come back here and ‘fess up.

Ross, S., & Gow, A. (1999). Digital Archaeology: Rescuing Neglected and Damaged Data Resources (Study report). London: University of Glasgow.

Sustainable preservation and access : Interim report

I'm a member of the grandly titled Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which has just released the interim report of its first year's scoping activity. Neil Beagrie spotted the press release, and blogged it earlier.

As the Executive Summary says

"There is general agreement that digital information is fundamental to the conduct of modern research, education, business, commerce, and government. Future accomplishments are accelerated through persistent access to data and digital materials, and their use, re-use, and re-purposing in ways both known and as yet unanticipated.

"There is no general agreement, however, about who is responsible and who should pay for the access to, and preservation of, valuable present and future digital information. "

So that's the focus of the Task Force. It's been a really interesting group to be on; for various reasons, I only managed to get to one of its 3 face to face meetings this year, but there's a wide range of views and skills there (ranging from economists, not at all dismal, to representatives from various digital communities). There was also the interesting approach of taking "testimony" from a number of experts in the field: asking them to come, make short presentations based around some questions we asked them, and to join in the debate. And of course, the inevitable regular telephone conference, as always at an inconvenient time (spanning so many time zones). But, as I say, the conversations have always been interesting, and the discussion yesterday on the methodology for the second year, leading to the final report, was fascinating.

Just one last quote, this time from the Preface:

"The sustainability of the long-term access and preservation of digital materials is a well-known challenge, and discussion frequently focuses on the difficult technical issues. Less clearly articulated are the organizational and economic issues implicit in the notion of sustainability which, at the risk of over-simplification, come down to two questions: How much does it cost? and Who should pay?"

I guess the over-simplified answer to the last question is: if we want it, we all will.

Survey on malformed PDFs?

A DCC-Associates member asks

"Does anyone know there has been a study to estimate how many PDF documents
do not comply with the PDF standards?"

I've not heard of anything, nor can I find one with my best Google searches, but it's a particularly hard question to ask Google! So, if you know of anything that has happened or is in progress, please leave a comment here. Thanks,

Saturday, 13 December 2008

What makes up data curation?

Following some discussion at the Digital Curation Conference in Edinburgh, how about this:

Data Curation comprises

Data management
Adding value to data
Data sharing for re-use
Data preservation for later re-use

Is that a good breakdown for you? Should data creation be in there? I tend to think that data creation belongs to the researcher; once created, the data immediately falls into the data management and adding value categories.

Adding value will have many elements. I’m not sure if collecting and associating contextual metadata is part of adding value, or simply good data management!

Heard at the conference: context is the new metadata (thanks Graeme Pow!).

Wednesday, 10 December 2008

IDCC4: DCC Blawg "consent" post

Just a little plug here; my colleague Mags McGeever, who provides legal information for the DCC, has her own blog (The DCC Blawg), and has recently been providing a bit more comment and opinion rather than just information (that's not legal opinion, mind!). Her recent post on the variant of "informed consent" used by Generation Scotland, mentioned by David Porteous in his opening Keynote at the Digital Curation Conference, is interesting, and some of her other recent articles are worth a look, too.

Tuesday, 9 December 2008

Comments on OAIS responses to our comments on OAIS

Yes, it sounds weird and it was, a bit. One of the workshops at the International Digital Curation Conference was to consider the proposed "dispositions" to the DCC/DPC comments on OAIS, made around two years ago! Sarah Higgins of the DCC and Frances Boyle of the DPC had an initial look and tried to work out which proposed dispositions we might have an issue with. A couple of the original participants made written comments in advance, but the rest of us (the original group or their successors, plus one slightly bemused visitor) had less than 3 hours to hammer through an see whether we could improve on the dispositions. The aim was to identify areas where we really felt the dispositon was wrong, to the extent that it would seriously weaken the standard, AND we were able to provide comments of the form "delete xxxx, insert yyyy". We did identify a few such, but were also pleased to discover that some areas where we thought there would be no change would in fact receive significant revision (although we haven't yet seen the revision).

We understand that the process now is for the OAIS group (MOIMS-DAI) to finalise their text, and then put it out for a second round of public comment early next year. After that, it will go through the CCSDS review process (ie it gets voted on by Space Agencies interested in Mission Operations Information), before going on to ISO, where it gets voted on by National Bodies involved in its Technical Committee! So, don't hold your breath!

IDCC4 afternoon research paper session

I really want to write about the afternoon research papers at the International Digital Curation Conference, but I start from a great disadvantage. First, I wanted to hear papers in BOTH the parallel sessions, so I ended up dashing between the two (and of course, getting lost)! Then my battery went flat, so no notes, and I left the laptop in the first lecture room while going off to the second… and of course, near the end of that paper, someone came in saying a couple of laptops had been stolen. Panic, mine was OK, thank goodness (when did I last do a proper backup? Ahem!).

I’ve reported on the Australian ARCHER and ANDS topics before; the papers were a little different, but the same theme. The paper I really wanted to hear was Jim Frew on Relay-Supporting Archives. Frew’s co-author was Greg Janee; both visited the DCC, Frew for a year and Janee for a few days, talking about the National Geospatial Data Archive. I thought this was a great paper, but I kept thinking “I’m sure I said that”. Well, maybe I did, but maybe I got the ideas from one or other of them! They have a refreshingly realistic approach, recognising that archive survivability is a potential problem. So if content is to survive, it needs to be able to move from archive to successor archive, hence the relay. An archive may then have to hand on content that it received from another archive, ie intrinsically unfamiliar content. How’s this for a challenging quote: “Resurrection is more likely than immortality”! He looked at too-large standards (eg METS), and too-small standards (eg OAI-ORE), and ended up with the just-right standard: the NGDA data model (hmmm, I was with him right up till then!).

The conclusion was interesting, though:

"Time for a New AIHT
Original was technology neutral
New one should test a specific set of candidate interoperability technologies"

The conference closed with an invigorating summing up from Malcolm Atkinson, e-Science Envoy. Slides not available yet; I’ll update when they are. He sent us off with a rousing message: what you do is important, get to it!

See you next year in London?

Strand B1 research papers at IDCC4

In the morning parallel session B at the International Digital Curation Conference, the ever-interesting Jane Hunter from the University of Queensland began the session speaking about her Aus-e-Lit project (linked to Austlit) The project is based on FRBR, and offers federated search interfaces to related distributed databases. They include tagging and annotation services, and compound object authoring tools… based on an OAI-ORE based tool called Literature Online Research Environment (LORE). These compound objects, published to the semantic web as RDF named graphs, can express complex relationships between authors and works that represent parts of literary history, tracking lineage of derivative works and ideas, scholarly editions, research trails, and also used for teaching and learning objects. There were of course problems: the need for unique identifiers for non-information resources; use of local identifiers, use of non-persistent identifiers; community concern about ontology complexity… but also a desire for more complexity (wasn’t it ever thus)!

Hunter was followed by Kai Naumann, from the State Archive of Baden-Wurttemberg. They have many issues, starting with the very wide range of objects from paper, microfilm etc and beginning digital objects, of which a large number are now being ingested. Need to find resources with finding aids regardless of media and object type. Need to maintain and assure authenticity and integrity. They chose to use the PREMIS approach to representations (which reminded me again how annoying the clash of PREMIS terminology with that of OAIS is!). It seemed to be a very sophisticated approach. Naumann suggested that preservation metadata models should balance 3 aims: instant availability, easy ingest and long-term understandability. For low use (as in real archives), with heterogeneous objects, you need to design relatively simple metadata sets. It’s important to maintain relational integrity between the content and metadata (I’m not sure that has a strict, relational database meaning!). Structural relations between content units can be a critical for authenticity.

Jim Downing of Cambridge came next, speaking for Pete Sefton, USQ (although he admitted he was a pale, short, damp substitute for the real thing!). The topic was embedding metadata & semantics in documents; they work together on the JISC-funded ICE-Theorem project. We need semantically rich documents for science, but most documents start in a (business-oriented) word processor. Semantically rich documents enable automation, reduce information loss, have better discovery and presentation. Unfortunately, users (and I think, authors) don’t really distinguish between metadata, semantics and data. Document and metadata creation are not really separable. Documents often have multiple, widely separated authors, particularly in the sciences. Their approach is to make it work with MS Word & OpenOffice Writer, & use Sefton’s ICE system. They need encodable semantics that are round-trip survivable: if created by a rich tool, then processed with a vanilla tool then checked again with the rich tool, the semantics should still be present, for real interoperability. Things they thought of but didn’t do: MS Word Smart Tags, MS Word foreign namespace XML, ODF embedded semantics, anything that breaks WYSIWYG such as wiki markup in document), new encoding standards. What does work? An approach based on microformats (I thought simply labelling it as microformats was a bit of an over-statement, given the apparent glacial pace of “official” microformat standardisation!). They will overload semantics into existing capabilities, eg tables, styles, links, frames, bookmarks, fields (some still fragile).

Paul McKeown from EMC Corp gave a paper written by Stephen Todd on XAM, a standard newly published by SNIA. It originated from the SNIA 100 year archive survey report. Separating logical and physical data formats, and the need over time to migrate content to new archives. Location-independent object naming (XUID), with rich metadata, and pluggable architecture for storage system support. Application tool talks to XAM layer which abstracts away the vendor implementations. 3 primary objects: XAM library (object factory?), XSet, XStreams, etc. Has a retention model. (Sorry Paul, Stephen, I guess I was getting tired at this point!)

Martin Lewis on University Libraries and data curation

Martin Lewis opened the second day of the International Digital Curation Conference with a provocative and amusing keynote on the possible roles of libraries in curating data. It was very early, with his presentation [large PPT] starting at 8:40 am, and the audience after the conference dinner in the splendid environs of Edinburgh Castle was unsurprisingly thin. However, absentees missed an entertaining and thought-provoking start to the day. Martin is great at the provocative remark, usually tongue in cheek; perhaps not all our visitors caught the ironic tone, a peril of speaking thus to an international audience (eg “this slide shows a spectrum of the disciplines, from the arts and humanities on this side, to the subjects with intellectual rigour over here”!).

He mentioned some common keywords when people think of libraries: conservation, preservation, permanent, but perhaps seen as intimidating, risk-averse, conservative. What can libraries do in relation to data? He promised us 8 things, but then came up with a 9th:

Raise awareness of data issues,
lead policy on data management,
provide advice to researchers about their data management early in life cycle,
work with IT colleagues to develop appropriate local data management capacity,
collectively influence UK policy on research data management,
teach data literacy to research students,
develop KB of existing staff,
work with LIS educators to identify required workforce skills (see Swan report), and
enrich UG learning experience through access to research data.

Note, this list does not (currently) include much direct, practical support for data curation or preservation: no value-added data repositories, etc. Although a few libraries are venturing gently into this area, without extra support he believes that libraries are too stretched to be able to take on such a challenge. The library sector is stretched very thin, bearing in mind the major changes to accommodate the change to digital publishing, and large investments in improving support for the teaching and learning side of things, including his own libraries new building (which is leading to increasing library footfall, and increasing loans of non-digital materials, which must still be managed).

(It struck me later, by the way that this presentation was in a way quintessentially British, or at least non-American, in the extent of its attention to these non-research elements; American chief librarian colleagues have seemed to be focused on the research library and less interested in the under-graduate library.)

So bearing the resource problems in mind, Lewis took us through the Library/IT response to the English Funding Council’s Shared Services programme: the UKRDS feasibility study. He noted the scale of the data challenge, still poorly served in policy & infrastructure, the major gaps in current data services provision. University Library and IT Services in several institutions are coming under pressure to provide data-oriented services, mainly just now for storage (necessary but not sufficient for curation). He reminded us that the UK eScience core programme was a world leader; we had many reports to refer to, especially including Liz Lyon’s “Dealing with Data” report, but we are getting to the point where we need services not projects.

The feasibility study surprisingly shows low levels of use of national facilities; most data curation and sharing happens within the institution. The feasibility study identified 3 options:

do nothing,
massive central service, or
coordinated national service.

Despite an amusing side excursion exploring the imagined SWAT teams of a massive central service, swarming in from their attack helicopters to gather up neglected data, the last alternative is a clear winner for the study.

There was strong evidence base of gaps in data curation service provision in UK. Cost savings were hard to calculate in a new area, but were compared with the potential alternative of ubiquitous provision in universities for their own data. UKRDS would seek to embrace rather than replace existing services, but might provide them with additional opportunities. Next steps were to publish their report, which would recommend (and he hopes extract the funds to enable) a Pathfinder phase (operational, not pilot).

During questions, the inevitable from a representative of a discipline area with already well-established support (and I paraphrase): “how can you do the real business of curation, namely adding value to collected datasets, when you are at best generalists without real contact with the domain scientists necessary for curation?” To which, my response is “the only possible way: in partnership”!

Monday, 8 December 2008

Wilbanks on the Control Fallacy: How Radical Sharing out-competes

Closing the first day of the International Digital Curation Conference, and as a prelude to a substantial audience discussion, John Wilbanks from Science Commons outlined his vision and his group’s plans and achievements. His slides are available on Slideshare and from the IDCC web site.

John suggests that the only real alternative to radical sharing is inefficient sharing! Science (research and scholarship, to be more general) is in a way not unlike a giant Wikipedia, an ever-changing consensus machine, based on publishing (disclosing, making things public); advances by individual action, and by discrete edits (ie small changes to the body of research represented by individual research contributions). Unlike Wikipedia the Science consensus machine is slow and expensive, but it does have strong authentication and trust. It represents an “inefficient and expensive ecosystem of processes to peer-produce and review scholarly content”.

However, disruptive processes can’t be planned for, and when they occur, attract opposition from entrenched interests (open access is one such disruptive system). So a scholarly paper may be thought of as “an advertisement for years of scholarship” (I think he gave a reference but I can’t find it). International research forms a highly stable system, good at resisting change on multiple levels, and in many cases this is a Good Thing. This system includes Copyright (Wilbanks suggested IPR was an unpopular term, new to me but perhaps a US perspective); traditionally, in the analogue world, copyright locks up the container, not the facts. New publishing business models lock up even more rights on “rented” information. If we can deposit our own articles (estimated cost 40 minutes per researcher per year), then we can add further services over our own material (individually or collectively), including tracking use and re-use; then we can out-compete the non-sharers.

Science Commons has been working for the past 2 years focusing their efforts in both the particular (one research domain, building the Neurocommons) and the general; their approach requires sharing. They discovered two things quite early on: first that international rules on intellectual property in data vary so widely that building a general data licence to parallel the Creative Commons licences for text was near impossible, and second that viral licences (like the Creative Commons Share-Alike licences, and the GPL) act against sharing, since content under different viral licences cannot be mixed. So their plan is to try putting their data into the public domain as the most free approach. They use a protocol (not licence) for implementing open data. reward comes through trademark, badging as part of the same “tribe”. Enforcement doesn’t apply; community norms rule, as they do in other areas of scholarly publishing. For example attribution is a legal term of art (!!), but in scholarly publishing we prefer the use of citations to acknowledge sources, with the alternative for an author being possible accusations of plagiarism, rather than legal action (in most cases).

There was some more, on the specifics of their projects, on trying to get support in ‘omics areas for a semantic web of linked concepts, built around simple, freely alterable ontologies; how doing this will help Google Scholar rank papers better. Well worth a look at the slides (worth while anyway, since I may well have got some of this wrong!). But Wilbanks ended with a resounding rallying cry: Turn those locks into gears! Don’t wait, start now.

Friday, 5 December 2008

Bryan Lawrence on metadata as limit on sustainability

Opening the Sustainability session at the Digital Curation Conference, Bryan Lawrence of the Centre for Environmental Data Archival and the British Atmospheric Data Centre (BADC), spoke trenchantly (as always) on sustainability with specific reference to the metadata needed for preservation and curation, and for facilitation for now and the future. Preservation is not enough; active curation is needed. BADC has ~150 real datasets but thousands of virtual datasets, tens of millions of files.

Metadata, in his environment, represents the limiting factor. A critical part of Bryan’s argument on costs relates to the limits on human ability to do tasks, particularly certain types of repetitive tasks. We will never look at all our data, so we must automate, in particular we must process automatically on ingest. Metadata really matters to support this.

Bryan dashed past an interesting classification of metadata, which from his slides is as follows:

A – Archival (and I don’t think he means PREMIS here: “normally generated from internal metadata”)
B – Browse: context, generic, semantic (NERC developing a schema here called MOLES: Metadata Objects for Linking Environmental Sciences)
C – Character and Citation: post-fact annotation and citations, both internal and external
D – Discovery metadata, suitable for harvesting into catalogues: DC, NASA-DIF, ISO19115/19139 etc
E – Extra: Discipline-specific metadata
O – Ontology (RDF)
Q – Query: defined and supported text, semantic and spatio-temporal queries
S – Security metadata

The critical path relates to metadata, not content; it is important to minimise the need for human intervention, and this means minimising the number of ingestion systems (specific processes for different data types and data streams), and to minimise the types of data transformations required (problem is validating the transformations). So this means that advice from data scientists TO the scientists is critical before creation; hence the data scientist needs domain knowledge to support curation.

Can choose NOT TO TAKE THE DATA (but the act of not taking the data is resource intensive). Bryan showed and developed a cost model based on 6 components; it’s worth looking at his slides for this. But the really interesting stuff was on his conclusions on limits, with 25 FTE:

"• We can support o(10) new TYPES of data stream per year.
• We can support doing something manually if it requires at most:
– o(100) activities of a few hours, or
– o(1000) activities of a few minutes, but even then only when supported by a modicum of automation.
• If we have to [do] o(10K) things, it has to be completely automated and require no human intervention (although it will need quality control).
• Automation takes time and money.
• If we haven’t documented it on ingestion, we might as well not have it …
• If we have documented it on ingestion, it is effectively free to keep it …
• … in most cases it costs more to evaluate for disposal than keeping the data.
• (but … it might be worth re-ingesting some of our data)
• (but … when we have to migrate our information systems, appraisal for disposal becomes more relevant)”

Interestingly, they charge for current storage costs for 3 years (only) at ingest time; by then, storage will be “small change” provided new data, with new storage requirements, keep arriving. Often the money arrives first and the data very much later, so they may have to be a “banker” for a long time. They have a core budget that covers administration, infrastructure, user support, and access service development and deployment. Everything changes next year however, with their new role supporting the Inter-Governmental Panel on Climate Change, needing Petabytes plus.

Thursday, 4 December 2008

International Digital Curation Conference Keynote

I'm not sure how much I should be blogging about this conference, given that the DCC ran it, and I chaired quite a few sessions etc. But since I've got the scrappy notes, I may try to turn some of them into blog posts. I've spotted blog postings from Kevin Ashley on da blog, and Cameron Neylon on Science in the Open, so far.

Excellent keynote from David Porteous, articulate and passionate supporter of the major “Generation Scotland” volunteer-family-based study of Scottish Health (notoriously variable, linked to money, diet, and smoking, and in some areas notoriously poor). Some interesting demographic images based on Scottish demography 1911, 1951, 2001 and projected to 2031, show the population getting older as we know. It’s not only the increasing tax burden on a decreasing proportion of workers, but also the rise of chronic disease. The fantastically-named Grim Reaper’s Road Map of Mortality shows the very unequal distribution, with particular black spots in Glasgow. About half of these effects are “nurture” and about half are “nature”, but really it’s about the interplay between them.

He spoke next about genetics, sequencing and screening. Genome sequencing took 13 years and cost $3B, now a few weeks and $500K, next year down to a few $K? Moving to a system where we hope to identify individuals at risk, start health surveillance, understand the genetic effects and target rational drug development, we hope reducing bad reactions to drugs.

Because of population stability, the health and aging characteristics, and a few legal and practical issues (such as cradle-to-grave health records), it turns out that Scotland is particularly suited to this kind of study. Generation Scotland is a volunteer, family-based study (illustration from the Broons!). There are Centres in Edinburgh, Glasgow, Dundee, Aberdeen; no-one should be more than an hour’s travel away etc. Major emphasis on data collection, curation and integration through an integrated laboratory management system, in turn linking to the health service and its records. Major emphasis on security and privacy. Consent is open consent (rather than informed consent), but all have a right to withdraw from the study, and must be able to withdraw any future use of their data (only 2 out of nearly 14,000 have withdrawn so far!).

This wasn’t a talk about the details of curation, but it was an inspiring example of why we care about our data, and how, when the benefits are great enough and the planning and careful preparation are good enough, even major legal obstacles can be overcome.

Digital Curation Blog