Digital Curation Blog: July 2008

Tuesday, 29 July 2008

"Digital Preservation" term considered harmful?

Over the past few weeks I have become acutely aware that the term “digital preservation” may be becoming a problem. This issue was first brought out (as far as I’m aware) by James Currall in the espida project:

“The digital preservation community has become very good at talking to itself and convincing ‘paid-up’ members of the value of preserving digital information, but the language used and the way that the discourse is constructed is unlikely to make much impact on either decision-makers or the creators of the digital information (academics, administrators, etc.).”

(espida update report, march 2006, internal communication.)

The espida project drove this reasoning forward towards the creation of persuasive business cases built on a variant of the balanced score card approach. This is very valuable, but my little epiphany was slightly different. It’s not just the language that makes digital preservation unconvincing to the decision maker. Part of the problem is that digital preservation describes a process, and not an outcome.

There are many arcane, backroom processes involved in the familiar concept of “the library”, including accessions, cataloguing, circulation and conservation. In general, neither the library user nor the institutional decision maker need be concerned about these processes. The large set of outcomes (including for example a place for study, or a resource for future scholarship) is what we value.

Similarly, in the digital domain, we should be selling the outcomes. While we have to use "digital preservation" in appropriate contexts, including technical and other in-house discussions, and digital curation is appropriate in other contexts, terms that reflect the outcomes are more persuasive. The outcome of successful digital preservation is that digital resources remain accessible and usable over the long term. This outcome-related notion is immediately understandable to users and decision makers, and most will find something in their past with which it resonates. Even members of the general public "get" the need to ensure their digital photos and other documents remain accessible over time, even if they don't act on this awareness. In contrast, digital preservation has been over-sold as difficult, complex and expensive over the long term, while the term itself contains no notion of its own value.

So I would argue that outcome-related phrases like "long term accessibility" or "usability over time" are better than the process-oriented phrase "digital preservation".

Monday, 28 July 2008

Claimed 200 year media life

Graeme Pow spotted the announcement a few weeks back of Delkin's Archival Blu-ray media:

"New from Delkin Devices is Archival Gold Blu-ray, a recordable disc that guarantees to preserve data safely for over 200 years. In addition to unprecedented longevity standards, Delkin BD-R boast a market-leading read/write speed of 4x, enabling a 25GB burn to be completed in only 23 minutes -- and the disc is coated in ScratchArmour, a scratch-proof coating that claims to be 50 times better than other coatings. "

Italicised part is a direct quote from Delkin's site. Delkin also claim at the site linked above:

"Delkin archival Blu-ray (BD-R) discs offer the longest guaranteed protection over time."

There is a fair amount of scepticism about such claims, for example see the cdfreaks forum thread on archival gold DVD-Rs...

So what does this mean?

Well firstly they clearly haven't run the media for 200 years, so presumably this is the result of some accelerated aging tests extrapolated into "normal" use.
Secondly, if the disk did fail in however many years into the future, who would your successors claim from, and what? I don't have a copy of the "guarantee", but my guess is you'd at best get replacement media cost, not the lost content value (a wag on cdfreaks suggested that the guarantee was only available to the original purchaser ;-).
Thirdly, if your successors still have the disk in 200 years, what are the odds of having hardware and software to read it? Close to zero I guess.
Fourthly, would you even know what's on it? Oh, you wrote a label on the outside, did you? Or perhaps wrote in marker on the disk? Or...

It might seem like a good thing to have disks that last for much longer than the devices. After all, that would remove one error source, and would mean you could use your devices to help in the transfer to newer media. However, assuming the media really were robust, this might lead to a temptation to ignore the transfer, until you (or your descendants or successors) suddenly find your last Blu-ray device has failed!

In practice, of course, relying on any such claims would be foolish. It's not a bad idea to use good quality media, but I would want to choose it based on testing from an independent lab rather than manufacturers' claims. And then it should fit into a well-planned strategy of media management. I guess it would be wise to keep some archival media of the same type but non-critical contents, and test them from time-to-time, watching for increased error rates.

Interestingly, I've just re-read (yet again) the OAIS section 5, which covers these issues. I can't say it's a lot of help for most people wanting advice. It refers to the whole issue as one of "Digital Migration" (not the customary use of the word migration in digital preservation circles these days, where it is mostly used in contrast to emulation), and lists 4 types:

Refreshment: A Digital Migration where a media instance, holding one or more AIPs or parts of AIPs, is replaced by a media instance of the same type by copying the bits on the medium used to hold AIPs and to manage and access the medium. As a result, the existing Archival Storage mapping infrastructure, without alteration, is able to continue to locate and access the AIP.
–Replication: A Digital Migration where there is no change to the Packaging Information, the Content Information and the PDI. The bits used to convey these information objects are preserved in the transfer to the same or new media-type instance. Note that Refreshment is also a Replication, but Replication may require changes to the Archival Storage mapping infrastructure.
–Repackaging: A Digital Migration where there is some change in the bits of the Packaging Information.
Transformation: A Digital Migration where there is some change in the Content Information or PDI bits while attempting to preserve the full information content.

So that's clear, then. I think the first two are merely replacement by identical media. Later text makes clear that the third is the case we would be considering here, ie the replacement of one media type with another, while the fourth would represent what we would currently call migration, ie making some change to the information object in order to preserve its information content. However, despite quite a bit of discussion on scenarios of the various digital migration types, I could not see much in the way of good practice advice there.

Tuesday, 22 July 2008

How open is that data?

Thanks to the Science Commons blog for drawing this article on Nature Precedings to my attention:

DE ROSNAY, M. D. (2008) Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness. Nature Precedings. doi:10.1038/npre.2008.2083.1

There is an interesting discussion of the issues affecting accessibility (like "no policy visible on the web site"), and the article includes this good set of questions any data curators should ask themselves:

"A. Check your database technical accessibility

A.1. Do you provide a link to download the whole database?
A.2. Is the dataset available in at least one standard format?
A.3. Do you provide comments and annotations fields allowing users to understand the data?

B. Check your database legal accessibility

B.1. Do you provide a policy expressing terms of use of your database?
B.2. Is the policy clearly indicated on your website?
B.3. Are the terms short and easy to understand by non-lawyers?
B.4. Does the policy authorize redistribution, reuse and modification without restrictions or contractual requirements on the user or the usage?
B.5. Is the attribution requirement at most as strong as the acknowledgment norms of your scientific community?"

All good questions to ask, although in some fields the right answer (for some datasets) to questions such as B.4 should be "No" (ethical considerations, for example, might dictate otherwise).

Saturday, 19 July 2008

Load testing repository platforms

Very interesting post on Stewart Lewis's blog:

"One of the core aims of the ROAD project is to load test DSpace, EPrints and Fedora repositories to see how they scale when it comes to using them as repositories to archive large amounts of data (in the form of experimental results and metadata). According to ROAR, the largest repositories (housing open access materials) based on these platforms are 191,510, 59,715 and 85,982 respectively (as of 18th July 208). We want to push them further and see how they fare."

Stuart goes on to mention a paper on "Testing the Scalability of a DSpace-based Archive", that pushed it up to 1 million items with "no particular issues"! They have their chunky test hardware in place, and plan to do the stress test using SWORD, so that they can throw the same suite of data at each repository.

He says

"More details will be blogged once we start getting some useful comparative data, however seeing as the report cited above took about 10 days to deposit 1 million items, it may be some weeks before we’re able to report data from meaningful tests on each platform."

This is an extremely worthwhile exercise, particularly for those thinking of using repository software platforms for data, where the numbers of objects can very rapidly mount up.

Wednesday, 16 July 2008

JIF08 Technical Infrastructure session

At the second day of the JISC Innovation Forum, I attended an interesting discussion in the data theme on technical infrastructures. This post derives from that discussion, but is neither a complete reflection of what was said, nor at all in the order of discussion; it reflects some bits I found interesting. My thanks to Matthew Dovey who chaired, and to all who contributed; I’ve identified none rather than only some! The session is also being blogged here…

Although early on we were talking about preservation, we came back to a discussion on immediate re-use and sharing; it seems to me that these uses are more powerful drivers than long term preservation. If we organise things right, long term preservation could be a cheap consequence of making things available for re-use and sharing. Motivations quoted here include data as evidence, the validation aspects of the scientific method. We were cautioned that validation from the data is hard; you will need the full data and analysis chain to be effective (more data, providing better provenance…).

Contextual metadata might include (parts of) the original proposal (disclosing purpose and motivation). Funders keep these proposals, as part of their records management systems, but they might not be easily accessible and are likely eventually to be deleted. An action for JISC might be to find ways of making parts of these proposals more appropriately available to support data.

There are issues of the amount of disclosure or discoverability that people want, or maybe should offer whether they want or not; we touched tangentially on open notebook science. Scary, but so is sharing data, for some! Extending re-usability through re-interpreting to integration may be big steps.

To what extent does standardisation help? Well, quite a lot, obviously, although science is slow and science is innovative, so many researchers will either have no appropriate standards to se, or not know about the standards, or have poor or no tools to implement/use the standards, or maybe find the standards less than adequate and overload them with their own private extensions or encodings. And to some extent re-use could be about the process as much as the data (someone spoke of preserving workflows, and someone of preserving protocols, although “process” may be much less formal than either…).

We had a recurring discussion about the extent this feeds back to being a methodological question. There was a question about the efficacy of research practice; how do we maximise positive outcomes? Should we aim change science to achieve better curation? While in the large scale this sounds exceedingly unlikely, research being exceedingly resistant to external change, successes were reported. BADC have managed to get scientists to change their methods to collect better metadata, and maybe better data. Maybe some impact could be made via research training curricula? Another opportunity for JISC?

We talked about the selection of useful data; essential, but how can it be achieved? Some projects, such LHC have this built into the design as the volumes are orders of magnitudes too large to deal with otherwise. While low-retention selection might be attractive, others were arguing elsewhere for retaining more data that are currently lost, to improve provenance (ie successive refinement stages).

The idea of institutional data audit was mentioned; there is a pilot data audit framework being developed now, and at least one institution in the room was participating in the pilot process. This might be a way of bringing issues to management attention, so that institutions can understand their own needs. Extending this more widely may be useful.

We talked about what a research library could do, or maybe has to start thinking of doing. On the small scale, success in digital thesis deposit is bringing problems in managing associated data files (often of widely varied types: audio, video, surveys, spreadsheets, databases, instrumentation outputs, etc). At the moment the answer is often to ZIP up the supplementary files and cross your fingers (as close to benign neglect as possible!). This is very close to the approach in use for years with paper theses, where supplementary materials were often in portable media in a pocket inside the back cover… close the volume and put it on the shelves. This might be acceptable with a trickle, but not a deluge… What would be a better approach in dealing with this growing problem (exacerbated by issues related to the well-known neologism problems of theses, ie that young researchers will likely not use applicable standards in quite the right way)? It was clear that dealing with this stuff is beyond the expertise of most librarians, and requires some sort of partnership with domain scientists. This whole area could represent an opportunity for JISC to help.

Since we were thinking of institutional repositories and larger scale subject repositories, we had a skirmish on the extent to which we need a full-blown OAIS, as it were, or a minimal, just-enough effort. It seemed in a sense the answer was: both! It does make sense to think carefully about OAIS, but also to make sure you do just enough, to ensure that high preservation costs in one area are not preventing collection in other areas (selection anyway being so fallible). A good infrastructure should help keep it cheap. There is a question on how you would know if you were paying too much in effort, or too little? Perhaps repository audit and certification, or perhaps risk assessment might help.

This brought us on to the issue of tool development. Most existing, large scale data centres use home-built, filestore-based systems, not very suitable for small-scale use. Existing repository software platforms are not well suited to data (square peg in round hole was quoted!). Funding developments to improve the fit for data was seen as a possible role for JISC. Adding some better OAIS capabilities at the same time might also be useful, as might linking to Virtual research Environments such as Sakai, or to Current Research Information Systems (CRIS’s). Is the CERIF standard developed by EuroCRIS helpful, or not?

Overall, it was a useful session, and if JISC takes up some of the opportunities for development suggested, it should prove doubly, trebly or even more useful!

Tuesday, 15 July 2008

IPR and science data integration

Preparing for the JISC Innovation Forum, I have been reading John Wilbanks’ comment piece (Wilbanks, 2008) tracing the reasoning behind the Science Commons approach of abandoning Creative Commons-style licensing for integratable data, in favour of a dedication to the public domain plus codified community norms. To be honest, I was gob-smacked when this came about; it seemed that a desirable outcome (CC-like licensing for data) had been abandoned. However, Wilbanks makes a powerful argument, and as a non-lawyer (albeit someone who has been arguing about the existence and/or nature of any implied licence for web pages since 1995 or so), it seems to me a very convincing one.

It is a critical point that this is the approach of choice where the data are to be offered for integration with other data on the open web. So this approach would be sensible for most of the kinds of data we think of as forming part of the semantic web, for instance.

The arguments clearly do not apply, and are specifically not intended to apply, to all classes of science data or databases. For instance, many science data are collected under grants from funders who require conditions to be placed on subsequent use. Wilbanks analysis might suggest that some of these mandates are misguided, and this may be so, but some have very strong ethical and legal bases, particularly (but not only) research producing or using social science or medical data relating to individuals.

The UK Data Archive, for example, requires registration and agreement to an end user licence before access to any datasets. The standard licence, for example, requires me (my emphasis):

“2. To give access to the Data Collections, in whole or in part, or any material derived from the Data Collections, only to registered users who have received permission from the UK Data Archive to use the Data Collections, with the exception of Data Collections supplied for the stated purpose of teaching.”

And…

“5. To preserve at all times the confidentiality of information pertaining to identifiable individuals and/or households that are recorded in the Data Collections where the information contained in the Data Collections was created less than 100 years previously, or where such information is not in the public domain. In addition, where so requested, to preserve the confidentiality of information about, or supplied by, organisations recorded in the Data Collections. In particular I undertake not to use or attempt to use the Data Collections to deliberately compromise or otherwise infringe the confidentiality of individuals, households or organisations. Users are asked to note that, where Data Collections contain personal data, they are required to abide by the current Data Protection Act in their use of such data. “

UKDA are right to take a strong protectionist approach to most of these data (maybe the 1881 census data could get a more wide-ranging exclusion than that implied in the previous paragraph). Problems of cascading and non-inter-operable licence conditions may arise, but probably only on a small scale, and likely to be resolvable on the basis of negotiations, or through emerging facilities specifically aimed at providing access to potentially disclosive data.

Wilbanks recognises these problems:

“There will be significant amounts of data that is not or cannot be made available under this protocol. In such cases, it is desirable that the owner provides metadata (as data) under this protocol so that the existence of the non-open access data is discoverable.”

The UKDA metadata catalogue is both open and interoperates, so it does effectively apply this rule (although perhaps not yet to the letter).

Likewise, BADC distributes some data that it has acquired through NERC (funder) mandates (effectively their data policy, apparently under review), but also some data that it has acquired from sources such as the Met Office Hadley Centre, for whom it might have high commercial value. It's not surprising that some of these funders impose restrictions, although it would be good if they would look long and hard at the value question before doing so.

At the Forum legal session on data, there was a debate on the motion: “Curating and sharing research data is best done where the researcher’s institution asserts IPR claims over the data”. Prior to the speakers, a straw poll suggested that 5 were in favour, 10 against with 7 abstentions. After the debate, the motion was lost with 6 in favour, 14 against and 2 abstentions. What the debate most illuminated, perhaps, was the widespread distrust held for the whole apparatus: institutions, publishers, researchers, even curators… and in addition, we all probably shared a lack of clarity about the nature of data, the requirements of collaboration, the impacts of disciplinary norms, effective business models, etc, etc.

In practice, of course, curation is going to require a partnership of all the stakeholders identified above, and probably more!

WILBANKS, J. (2008) Public domain, copyright licenses and the freedom to integrate science. Journal of Science Communication, 7. http://jcom.sissa.it/

IDCC 4 paper deadline fast approaching

This is a reminder that the closing date is fast approaching for submission of full papers, posters and demos for the conference (details below). The deadline is 25 July 2008. We invite submissions from individuals, organisations and institutions across all disciplines and domains engaged in the creation, use and management of digital data, especially those involved in the challenges of curating and preserving data in eScience and eResearch. Templates for submissions and other details are available at http://www.dcc.ac.uk/events/dcc-2008/ . Remember, accepted papers will be submitted for publication in the International Journal of Digital Curation, http://www.ijdc.net.

The 4th International Digital Curation Conference will be held on 1-3 December 2008 at the Hilton Grosvenor Hotel, Edinburgh, Scotland, UK. The conference will be held in partnership with the National eScience Centre and supported by the Coalition for Networked Information (CNI).

The main themes of the call for papers include:

Research Data Infrastructures
Curation and eResearch
Sustainability
Disciplinary and Inter-disciplinary Curation Challenges
Challenging types of Content
Legal and Licence Issues, and
Capacity Building

… but see the call for more details.

We are sure there are plenty of you out there with great ideas to share; still time to get that draft paper into final shape!

Wednesday, 9 July 2008

Semantic web on the Today programme

Turning on the radio this morning, I was surprised to hear someone discussing data on the Today programme on BBC Radio 4. It turned out to be Sir Tim Berners Lee talking about the semantic web; he even managed to mention RDF and HTML without confusing the interviewers too much. The interesting 8 and a half minute discussion is available via the BBC iPlayer.

I'm quite keen on understanding better how the semantic web might relate to science data. When people talk about data in relation to the semantic web, they often seem to be thinking the sort of relatively unitary facts or simple relationships that RDF triples are quite good at expressing, such as the population of China is 1,330,044,605. It's certainly not clear to me how this generalises to express the changing population, let alone how it could express the data from a spectrometer or
a crystal structure determination or remote-sensing data. If anyone can point me to a good resource discussing this, I would be grateful!

Tuesday, 8 July 2008

Old and not so old peripherals needed

The DCC has started to get a number of help-desk queries from people who want to know how to read information from obsolete media, and therefore want to find someone with a working example of the appropriate peripheral. These days, this can be as simple as finding a working 3.5 inch diskette drive (I'm not sure there is one on this entire floor!). We have also been asked to report the whereabouts of 5.25 inch diskettes, 80-column punched card readers, open reel magnetic tape drives and Jaz and other "high capacity" portable media drives. I have a bunch of Jaz media myself, although I think (!) that everything important is now on my hard drive AND backed up. It can be surprisingly hard to find the answer to such questions.

In 1999 the eLib Programme commissioned a study that resulted in a useful report: ROSS, S. & GOW, A. (1999) Digital Archaeology: Rescuing Neglected and Damaged Data Resources. Appendix 3 of that report lists then current data recovery companies, some of which are still in operation (perhaps it would be useful for JISC to commission someone to update this report?). These companies would certainly be one place to start, albeit an expensive one (as they are aimed at the corporate disaster recovery market).

Perhaps institutions should have a considered strategy for continued access to the more common examples. As staff retire and remember possible valuable content amongst their pile of obsolete media, it would be valuable to have a simple means at least to check them...

Does anyone know of any register of the less common such peripherals available within the public sector, preferably available for external use at a reasonable fee?

As a starter for 10, I believe there is a 80-column punched card reader at the UK Data Archive...

Monday, 7 July 2008

Middle Earth survey preservation

This isn't, as the title of the post might suggest, a posting about preserving geological or geographical data. The CADAIR repository at Aberystwyth University recently ingested a unusually large audience response survey to the Lord of the Rings film, containing just under 25,000 responses from speakers of 14 different languages. One of the repository workers, Stuart Lewis, has posted the details of this on his blog. In short, two versions of the database have been archived: an MS Access version, and an XML representation created by MS Access. These are accompanied by PDF and DOC versions of a user guide to the LotR survey data, and they are all accessible via the local CADAIR DSpace installation.

It's not clear how long the repository will preserve these items for and I couldn't find a preservation policy to provide any insight on this. We had a quick look at both versions and spotted a few issues that affect usability regardless of intended length of retention, such as:

Several questions contain numeric values as answers, but it's not clear what the values are supposed to represent. Eg, on the gender question, does 1 = male or female? You could probably work them out by cross referencing the data against the copy of the questionnaire included in the accompanying user guide, but this isn't a failsafe technique!

You still need the user guide to understand the content, even for descriptive plain text entries, as column names are only abstracts of the question. For example, it's not clear what the answers in the column 'middle earth' are actually addressing without recourse to the user guide. This is the same in both the XML and MS Access version and highlights a potential issue in relying on the built-in converter tool alone because it doesn't allow you to manipulate the XML and add extra value. For instance, why not include more detailed information about the questions in the XML file?

It would be useful to know from the dataset record entry in CADAIR that it contains text in multiple languages - the dc.language value is english and this is correct insofar as field names go, but not for all of the content.

There are also possible issues with the independence of the XML created by MS Office, but I don't have much first hand experience of it so wouldn't like to comment too much. If anyone reading this does have such experience, then please share it!

Wednesday, 2 July 2008

Research Repository System persistent storage

This is the seventh and last of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

At a very basic level, the RRS should provide a Persistent Storage service. Completely agnostic as to objects, Persistent Storage would provide a personal, or group-oriented (ie within the institution) or project-oriented (ie beyond the institution) storage service that is properly backed up. There’s no claim that Persistent Storage would last for ever, but it must last beyond the next power spike, virus infection or laptop loss! It has to be easy to use, as simple as mounting a virtual drive (but has to work equally easily for researchers using all 3 common OS environments). Conversely (and this isn’t easy), there must be reliable ways of taking parts of it with you when away from base, so synchronisation with laptops or remote computers is essential. It should support anything: data, documents, ancillary objects, databases, whatever you need. It’s possible that “cloud computing” eg Amazon S3, the Carmen Cloud or other GRID services might be appropriate.

[I'll include the last two shorter parts here.

The RRS might include a full-blown OAIS digital preservation archive. Not many institutions run these at this time, so although I think it should, I hesitate to suggest it must!

Some spinoffs you should get from your RRS would include persistent elements for your personal, department, group or project web pages (even the pages themselves). It should provide support for your CV, eg elements of your bibliography, project history, etc. It will provide you and your group with persistent end-points to link to. And your institution will benefit, first from the fact that it is supporting its researchers in curating their data and supporting verifiability of publication, and also benefiting from the research disclosure aspects.]

I guess that just about wraps it up. So who's going to build one?

Research Repository System data management

This is the sixth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Data management support is where this starts to link more strongly back to digital curation. Bear in mind here, this is a Research Repository System; not all of these functions, or the next group, need to be supported by anything that looks like one of the current repository implementations! I’m not quite clear on all of the features you might need here, but we beginning to talk about a Data Repository.

It is essential that the Data Management elements support current, dynamic data, not just static data. You may need to capture data from instruments, process it through workflow pipelines, or simply sit and edit objects, eg correcting database entries. Data Management also needs to support the opposite: persistent data that you want to keep un-changed (or perhaps append other data to while keeping the first elements un-changed).

One important element could be the ability to check-point dynamic, changing or appending objects at various points in time (eg corresponding to an article). In support of an article, you might have a particular subset available as supplementary data, and other smaller subsets to link to graphs and tables. These checkpoints might be permanent (maybe not always), and would require careful disclosure control (for example, unknown reviewers might need access to check your results, prior to publication).

Some parts of Data Management might support laboratory notebook capabilities, keeping records with time-stamps on what you are doing, and automatically providing contextual metadata for some of the captured datasets. Some of these elements might also provide some Health and Safety support (who was doing what, where, when, with whom and for how long).

Research Repository System object disclosure control

This is the fifth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Object disclosure control is crucial to this system working. Many digital objects in the system would be inaccessible to the general public (unless you are working in an Open Science or Open Notebook way). You need to be able to keep some objects private to you, some objects private to your project or group (not restricted to your institution, however), and some objects public. There should probably be some kind of embargo support for the latter, perhaps time-based, and/or requiring confirmation from you before release. And since some digital objects here are very likely to be databases, there are some granularity issues, where varying disclosure rules might apply to different subsets of the database. Perhaps this is getting a bit tough!

Research Repository System authoring support

This is the fourth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Authoring support should include version control, collaboration, possibly publisher liaison, and be integrated with the repository deposit process. It does need object disclosure control, see below. Version control would support ideas, working drafts, pre-prints, working papers, submitted drafts undergoing editorial changes, and refereed and published versions. Collaboration support would need to include support for multiple authors contributing document parts, and assembly of these into larger parts and eventually “complete” drafts. It should also include some kind of multiple author checkout system for updates, something like CVS or SVN, maybe a bit WIKI-like. It must support a wide choice of document editor, eg Word, OpenOffice.org, LaTeX etc (I don’t know how to combine this with the previous requirement!).

Publisher liaison is maybe controversial. But why shouldn’t the RRS staff (or your library) support you in dealing with publishers. The RRS wants your articles and your data, and should help you negotiate and reserve the rights so that they can get them. So publisher liaison would include rights negotiation, submission to the publisher on your behalf of a specific version, support through the editorial revision process, and recovery of metadata from the published version for the RRS records and your own bibliography, web page and CV. Naturally, deposit in the repository would be integrated in this workflow; you only have to authorise opening to the public, or perhaps a more restricted audience.

Research Repository System identity management

This is the third of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System (RRS). I've suggested it should contain these elements:

I didn’t want to talk about identity management so early in the piece, but it turned up in so many parts, I had to take it out and talk about it first. In a good RRS, this should Just Work, but at the moment it probably wouldn’t do what I want! I may see the RRS as a special case of an Institutional Repository (IR), but many if not most research collaborations are cross-institutional. This means that if there is to be support for cross-institutional authoring, there has to be support for members of other institutions to log in to your RRS. And this has to be seamless and easy, ie done without having to acquire new identities.

In addition, Researcher Identity should provide name control, that is, it knows who you are and will fill in a standardised version of your name in appropriate places. It should know your affiliation (institution, department/school, group, project and/or possibly work package). It might know some default tags for your work (eg Chris is normally talking about "digital curation"). However, this naming support must extend beyond your institution, so that collaborators and co-authors can be first-class users of other features. And it should relate to your (and their) standard institutional username and credentials; nothing extra to remember. This implies (I think) something like Shibboleth support.

This is getting kind of complicated, and verging towards another complex realm of Current Research[er] Information Systems (CRIS). These worthy systems also aim to make your life easier by knowing all about you, and linking your identity and work together. But they are complex, have their own major projects and standards, and have been going for years without much impact that I can see, except in a few cases. The RRS should take account of EuroCRIS and CERIF (see Wikipedia page) as far as they might apply.

Research Repository System web orientation

This is one of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

By web orientation, I mean that the RRS should be Web 2.0-like, being user-centred in that it knows who you are, and exploits that knowledge, and is very easy to use. It should be oriented towards sharing, but (as you’ll see) in a very controlled way; that is items should be sharable with varying groups of people from close colleagues, through unkown reviewers, to the general public. I think simple tagging rather than complex metadata should be supported. Maybe some kind of syndication/publishing support like RSS or Atom. The RSS should have elements with first class Semantic Web capabilities, supporting RDF. And because research and education environments are so very varied, and so highly tailored to individual circumstances (even if this is only based on the personal preferences of the PhD student who left a few years ago!), the RRS should be highly configurable, based on substitutable components, and able to integrate with any workflow.

Although the idea is linked in my mind to institutional or laboratory repositories, there are aspects that seem to me to require a service at above the institutional level (where researchers from different institutions work together, for example). It could be that some genuine Web 2.0 entrepreneurship might provide this!

Negative Click, Positive Value Research Repository Systems

I promised to be more specific about what I would like to see in repositories that presented more value for less work overall, by offering facilities that allow it to become part of the researcher’s workflow. I’m going to refer to this as “the Research Repository System (RRS)” for convenience.

At the top of this post is a mind map illustrating the RRS. A more complete mind map (in PDF form) is accessible here.

The main elements that I think the RRS should support are (not in any particular order):

Here’s a quick scenario to illustrate some of this. Sam works in a highly cross-disciplinary laboratory, supported by a Research Repository System. Some data comes from instruments in the lab, some from surveys that can be answered in both paper and web form, some from reading current and older publications. All project files are kept in the Persistent Storage system, after the disaster last year when both the PIs lost their laptops from a car overseas, and much precious but un-backed-up data were lost. The data are managed through the RRS Data Management element, and Sam has requested a checkpoint of data in the system because the group is near finalising an article, and they want to make sure that the data that support the article remain available, and are not over-written by later data.

Sam is the principal author, and has contributed a significant chunk of the article, along with a colleague from their partner group in Australia; colleagues from this partner group have the same access as members of Sam’s group. Everyone on the joint author list has access to the article and contributes small sections or changes; the author management and version control system does a pretty good job of ensuring that changes don’t conflict. The article is just about to be submitted to the publisher, after the RRS staff have negotiated the rights appropriately, and Sam is checking out a version to do final edits on the plane to a conference in Chile.

None of the data are public yet, but they are expecting the publisher to request anonymous access to the data for the reviewers they assign. Disclosure control will make selected check-pointed data public once the article is published. Some of the data are primed to flow through to their designated Subject Repository at the same time.

One last synchronisation of her laptop with the Persistent Storage system, and Sam is off to get her taxi downstairs…

This blog post is really too big if I include everything, so I [have released] separate blog posts for each Research Repository System element, linking them all back to this post... and then come back here and link each element above to the corresponding detailed bits.

OK I’m sure there’s more, although I’m not sure a Research Repository System of this kind can be built for general use. Want one? Nothing up my sleeve!

Responses to RAW versus TIFF: compression, error and cost-related

This is the second post summarising responses to the “RAW versus TIFF” post made originally by Dave Thompson of the Wellcome Library. The key elements of Dave’s post were whether we should be archiving using RAW or TIFF (image-related responses to this question are summarized in a separate post). A subsidiary question on whether we should archive both is greatly affected by cost, which is dependent on issues of compression, errors etc. Responses on these topics are covered in this post. As I mentioned before, these responses came on the semi-closed DCC-Associates email list.

Part way through the email list discussion, Marc Fresko of Serco reminded us that there are two questions:

“One is the question addressed by respondents so far, namely what to do in principle. The answer is clear: best practice is to keep both original and preservation formats.

The second question is "what to do in a specific instance". It seems to me that this has not yet been addressed fully. This second question arises only if there is a problem (cost) associated with the best practice approach. If this is the case, then you have to look at the specifics of the situation, not just the general principles.”

Marc’s comment brings us back to costs, and the responses to this are summarised here.

Sean Martin of the British Library picked up Marc’s point on costs:

“The cost of storage has and will get cheaper (typically 30% a year) but storage is not free. While most other examples here have cited raw vs TIFF, I will use an example based on TIFF vs a compressed format, such as JP2K.

(1) While there will be considerable variations, loss less compression with JP2K requires ~1/3 the storage of a TIFF. (I have a specific example of a 135 Mbyte TIFF and a 42 Mbyte loss less JP2K.)

(2) The cost of ownership of bulk commodity storage is currently of the order of £1000 per Tera Byte (However, let's treat this as a guide rather than a precise figure.)

(3) If you are dealing with only a few Tbytes then it will probably not matter much.

(4) However, if you envisage, say 100 Tbytes compressed or 300 Tbytes raw then it probably does matter, with indicative costs of £100K and £300K respectively.

(5) One way to approach this is to assess (in some sense) what value will be created using the cheaper method and what ADDITIONAL VALUE will be created using the more expensive method. This is a form of options appraisal.

(6) In my example, probably only a small amount of additional value is created for the additional expense approaching £200K. This leads to the question "if we had £200K on what would we spend it?" and probably the answer is "not in this way".

(7) However, this line of reasoning can only be applied in the context of a specific situation. Hence, for example, some might rate that considerable additional value is created and is affordable by storing RAW and TIFF, while others might chose to store only JP2K.

(8) This means that different people facing different challenges are likely to come to different conclusions. There is no one answer that applies to everyone.”

Chris Puttick, CIO of Oxford Archaeology thought that Sean’s cost figures were high:

“But the more nearly precise figure is what affects the rest of the calculation. For protected online storage (NAS/iSCSI, RAID 5 SATA drives with multiple hotspares) you can source 30TB useable space for £10k i.e. nearer to £50/TB [compared to Sean’s £1000 per TB; excludes power and cooling costs].

So using the boxes described above in a mesh array we can do 300TB for ~£120k. Anyone who needs 300TB of storage should find £120k of capital expenditure neither here nor there. After all, if we say the average HQ RAW is around 60MB, go with the 1/3 ratio for the additionally stored derivative image, we need 80MB per image i.e. 300TB=3.75m images... So a cost per stored image of 1p/year for the worst-case lifespan of the kit.”

To Sean’s cost versus value question, Chris responds:

“And in the wider world we have to fight to get digital preservation accepted as an issue, let alone tell people we have to pay for it. Put the question in the way that leads to the favourable response i.e. are you willing to spend a penny per image?”

Responding to the question about whether storing RAW as well as a compressed form is worthwhile, Chris points out:

“[…] many, asked the question in the wrong way, would opt for the cheapest without actually understanding the issues: JP2K is lossless, but was information lost in the translation from the RAW?

Richard Wright of BBC Future Media & Technology had an interesting point on the potential fragility risk introduced by compression:

“Regarding compression -- there was a very interesting paper yesterday at the Imaging Science & Technology conference in Bern, by Volker Heydegger of Cologne University. He looked at the consequences of bit-level errors. In a perfect world these don't occur, or are recovered by error-correction technology. But then there's the real world.

His finding was that bit and byte losses stay localised in uncompressed files, but have consequences far beyond their absolute size when dealing with compressed files. A one-byte error has a tiny effect on an uncompressed TIFF (as measured by 'bytes affected', because only one byte is affected). But in a lossless JP2 file, 17% of the data in the file is affected! This is because the data in the compressed file are used in calculations that 'restore' the uncompressed data -- and one bad number in a calculation is much worse than one bad number in a sequence of pixel data.

It goes on and on. A 0.1% error rate affects about 0.5% of a TIFF, and 75% of a lossless JP2.

The BBC is very interested in how to cope with non-zero error rates, because we're already making about 2 petabytes of digital video per year in an archive preservation project, and the BBC as a whole will produce significantly larger amounts of born-digital video. We're going for uncompressed storage -- for simplicity and robustness, and because "error-compensation" schemes are possible for uncompressed data but appear to be impossible for compressed data. By error-compensation I mean simple things like identifying where (in an image) an error has occurred, and then using standard technology like "repeating the previous line" to compensate. Analogue technology (video recorders) relied on error-compensation, because analogue video-tape was a messy world. We now need robust digital technology, and compression appears to be the opposite: brittle.”

I do think this is an extremely interesting point, and one that perhaps deserved a blog post all of its own, to emphasise it. In later private correspondence, Richard mentioned there was some controversy about this paper, with one person apparently suggesting that the factor-three compression might allow two other copies to be made, thus reducing the fragility. Hmmmm...

David Rosenthal of Stanford and LOCKSS picked up on this, noting that the real world really does see significant levels of disk errors. He also picked up on Sean’s comments, but unlike Chris, thinks that Sean’s costs are too low (and that Chris was way too optimistic):

“The cost for a single copy is not realistic for preservation. The San Diego Supercomputer Center reported last year that the cost of keeping one disk copy online and three tape copies in a robot was about $3K/Tb/yr.

Amazon's S3 service in Europe costs $2160/TB/yr but it is not clear how reliable the service is. Last time I checked the terms & conditions they had no liability whatsoever for loss and damage to your data. Note also that moving data in or out of S3 costs too - $100/TB in and sliding scale per month for transfers out. Dynamic economic effects too complex to discuss here mean that it is very unlikely that anyone can significantly undercut Amazon's pricing unless they're operating at Amazon/Google/... scale.

On these figures [costs] would be something like $300K per year compressed or $900K per year [un-compressed].

And discussions of preservation with a one-year cost horizon are unrealistic too. At 30% per year decrease, deciding to store these images for 10 years is a commitment to spend $10M compressed or $30M uncompressed over those ten years. Now we're talking serious money.

Sean is right that there is no one answer, but there is one question everyone needs to ask in discussions like this - how much can you afford to pay?”

Reagan Moore amplified:

“Actually the cost at SDSC for storing data is now:

~$420/Tbyte/year for a tape copy
~$1050/Tbyte/year for a disk copy. This cost is expected to come down with the availability of Sun Thumper technology to around $800/Tbyte/year.

As equipment becomes cheaper, the amortized cost for a service decreases. The above costs include labor, hardware and software maintenance, media replacement every 5 years, capital equipment replacement every 5 years, software licenses, electricity, and overhead.”

David responded:

“[…] To oversimplify, what this means is that if you can afford to keep stuff for a decade, you can afford to keep it forever. But this all depends on disk and tape technology continuing to get better at about the rate it has been for at least the next decade. The omens look good enough to use as planning assumptions, but anything could happen. And we need to keep in mind that in digital preservation terms a decade is a blink of an eye.”

David Barret-Hague of Plasmon believes that the cost of long term storage “is also a function of the frequency of data migrations - these involve manual interactions which are never going to get cheaper even if the underlying hardware does.” So he was not sure how affording it for 10 years means you can store it forever. (His company also has a white paper on cost of ownership of archival storage, available at http://www.plasmon.com/resources/whitepapers.html.)

Reagan responded agreeing that a long term archive cannot afford to support manual interactions with each record. He added:

“Our current technology enables automation of preservation tasks, including execution of administrative functions and validation of assessment criteria. We automate storage on tape through use of tape silos.

With rule-based data management systems, data migrations can also be automated (copy data from old media, write to new media, validate checksums, register metadata). Current tape capacities are 1 Terabyte, with 6,000 cartridges per silo. With six silos, we can manage 36 Petabytes of data.

We do have to manually load the new tape cartridges into the silo. Fortunately, tape cartridge capacities continue to increase, implying the number of files written per tape increases over time. As long as storage capacities increase by a factor of two each tape generation, and the effective cost of a tape cartridge remains the same, the media cost remains finite (factor of two times the original media cost).

The labor cost, the replacement of the capital equipment, the electricity, software licenses, floor space, etc. are ongoing costs.”

Well this discussion could (and maybe will) go on. But with my head reeling at the thought of one location with 36 Petabytes, I think I’m going to call this a day!

Tuesday, 1 July 2008

Responses to RAW versus TIFF: image-related

This post summarises image-related responses to the “RAW versus TIFF” post made originally by Dave Thompson of the Wellcome Library. The key elements of Dave’s post were whether we should be archiving images using RAW (which contained more camera and exposure-related data) or TIFF (which was more standardized, likely to be accessible for longer, and had more available utilities). A subsidiary question on whether we should archive both is greatly affected by cost; responses related to this element are summarized in a separate post. As I mentioned before, these responses came on the semi-closed DCC-Associates email list. [I forgot to mention that Maureen Pennock re-posted Dave's question to this blog with some remarks of her own, and we also received a useful response which you can view there.]

One comment [on the DCC-Associates list] made by many was that JPEG2000 should be considered. Alan Morris of Morris and Ward asked if it was an ISO standard, to which Shuichi Iwata (former President of CODATA and Professor at the University of Tokyo) responded:

“Yes, it is. Organized by Joint Photographic Experts Group of ISO. It is used widely for compression of images at different levels.

http://www.jpeg.org/”

James Reid of EDINA, University of Edinburgh, mentioned that “the DPC report on JPEG 2000 has relevance”. He also expressed concern:

“RAW images tend to have proprietary and sensor specific aspects that makes them less than ideal for a generic preservation format - they are of course lossless but there would be an implicit overhead in deriving sensor specific options to make preservation possible (and these may not be publicly available due to proprietary interests)...”

Larry Murray of the Public Record Office of Northern Ireland mentioned how essential RAW is to him as a keen amateur photographer, allowing him to manipulate images. He added:

“James is quite correct in his comment but personally I think the decision to save the RAW images is one of available resources. The RAW images give the data owner the opportunity to control so many of the characteristics of the original image without data loss or integrity and the ability to bulk process said changes means serious consideration must be given to their value.

For other reference on Open RAW formats see http://www.openraw.org/”

Chris Puttick, CIO of Oxford Archaeology asked:

“RAW is the original - would you store a scan of a document as the primary method of preserving the document? I think RAW + [a common image format] is the best option. It does of course require greater storage space, but storage is cheap and getting cheaper every day.

Software such as DigiKam has libraries that read pretty much all the RAW formats including rarer ones such as that produced by the Foveon sensor (Sigma, Polaroid). Any additional documentation needed for RAW should be easily prised out of the sensor manufacturers if the right pressure is brought to bear...”

Stuart Gordon of Compact Services (Solutions) Ltd agrees:

“In order to ensure that the images are viewable for posterity we must have the primary archive images in a universal lossless format, currently TIFF however as standards evolve this will inevitably change.

The camera RAW formats are proprietary and are therefore not suitable as the primary archival format; however, if space is available then do archive the RAW as well.

As with all questions it depends on what your goals are and in our case it is ensuring that the archival images are stored in a format that will ensure they are accessible for posterity.”

Paul Wheatley, Digital Preservation Manager at the British Library commented:

“As well as the extremes of the camera/manufacturer specific formats and then at the other end of the spectrum, formats like JPEG2000 and TIFF, we also have a number of emerging RAW file “standards”, of which Open RAW is one. I’ve not seen any independent technology watch work to assess all these options in a reasoned way and make recommendations. Perhaps this is a candidate for a DPC Tech Watch piece? If so, I would like to see authorship from a panel of experts to ensure a balanced view. Both an independent viewpoint and careful analysis of the objectives behind the use of different file formats were a little lacking in the last DPC report. As Larry is hinting at here, I’m sure there are different answers depending on exactly what you want to achieve from the storage and preservation. These aims need to be carefully captured as well as the recommendations to which they relate.”

[UPDATE: Rob Berentsen, of Deloitte Consulting B.V.] brings us back to preservation concerns:

“The discussion whether to use Raw, Tiff or both is one that depends on what the (future) user demands will be and the current preservation demands are. As you can see, Larry prefers Raw so he can use the image in an optimal manner as photographer/manipulator. James however might have a different view on the importance of certain aspects of images and might have different future demands in mind.

Looking at the OAIS model, the most raw image should leave all roads open to create a TIFF image (or others). This TIFF might be more accessible, and even the non-lossless format JPG might be good enough for some people to have [the] image easily accessible over the internet.

To cope with this, the OAIS's archival information package (AIP) offers room for more then just one 'manifestation'. You can see one (the Raw version) as the preservation format, and others are possible future preservation formats or accessible formats. The accessible formats are commonly the ones end-users can get (DIP-package).”

Marc Fresko of Serco suggested there are two questions here:

“One is the question addressed by respondents so far, namely what to do in principle. The answer is clear: best practice is to keep both original and preservation formats.

The second question is "what to do in a specific instance". It seems to me that this has not yet been addressed fully. This second question arises only if there is a problem (cost) associated with the best practice approach. If this is the case, then you have to look at the specifics of the situation, not just the general principles. Most importantly:
what will you lose by going from RAW to TIFF (probably nothing that matters for many archival applications, but you will lose something which would matter in other applications)?
over how long do you need to preserve access to the images, compared to the likely lifetime of the software you'll need to access the specific RAW format?
You can easily see that different answers to these two questions will lead you to prefer different formats.”

Marc’s comment brings us back to costs, and there is a further series of emails on this topic which I will include in another blog post.

Colin Neilson of the DCC SCARP Project took the discussion towards science data:

“If you generalise and treat your camera as a "Digital Image Acquisition System" :-) then do you keep the original data from the instrument (camera) or would you prefer only to keep the derived data in some form of standardised output?

I suspect in a scientific context you would want to keep both. The problem is that the steps taken in deriving the data are not standardised or documented and may not be repeatable at a future time e.g. when the particular make of instrument (camera) becomes obsolete. The software & algorithms for processing captured image data to standard format are embedded in the camera and function to adjust the data produced from the proprietary image processor (e.g. Canon's DIGIC chip). Some of the "improvements" produced in image quality may not be welcome in a scientific context (e.g. may have introduced artefacts). Some makes and models of camera offer the option save both the "raw" image and the derived image at the cost of memory and processing time.

You can seem some of the issues developed when looking at open microscopy environments where the issue is image capture with effective data management system for multidimensional images including spatial, temporal, spectral and lifetime dimensions. Slide 13 of this presentation by Kevin Eliceiri sets out the current situation. This is quite an exciting area!

Of course this may all be a bit academic (not to say "over the top") if your main concern is to take the best steps to protect you holiday pictures or personal photography efforts. I have found this article useful in an amateur photography context.”

Finally (for this post), Riccardo Ferrante, who is IT Archivist & Electronic Records Program Director for the Smithsonian Institution Archives agreed with Colin:

“We have scientists here that consider the raw data absolutely critical and whose integrity and authenticity must be preserved (archival package). However, the usefulness of the data comes only when the raw data is filtered. The filtered data (derivation) is also considered an archival package by these scientists since it is reused repeatedly in the future, cited in publications, etc. and therefore needs its authenticity, integrity, and provenance to be preserved. To use OAIS-Reference Model language - the DIP generated from the original AIP (raw data) is itself a derived AIP.

In contrast, we have artists who deal solely in photographs and do not consider the raw image to be an object of record, but more like an ingredient for the object that is finalized after post-processing occurs, i.e. after it is published.

To me, it seems the common point is what the creator considers the object of record and the purpose of the raw data in light of that - this should be a major factor in determining the object of record from a recordkeeping viewpoint.”

I thought this was an extremely useful conversation!

Digital Curation Blog