Digital Curation Blog: January 2009

Wednesday, 21 January 2009

DCC Evaluation survey

We, the DCC, would like your help in evaluating our performance, and to refine our portfolio of products and services and better meet your needs. If you'd like to help then please take a moment to fill in our public survey at: <http://www.dcc.ac.uk/adding/public_survey/>

As a small token of our appreciation, we are offering one lucky entrant the chance to win an iPod nano (competition rules apply).

Tuesday, 20 January 2009

Load testing repositories

One of the issues that has worried me about moving from repositories of e-prints to repositories of data is the increased challenges of scale. Scale could be vastly different for data repositories in several dimensions, including

rate of deposit
numbers of objects
size of objects
rate of access
rate of change to existing objects...

Now Stuart Lewis is reporting on the first stage of the JISC-funded ROAD project, where they have load-tested a DSpace implementation (on a fairly chunky configuration), loading 300,000 digital objects of 9 MB each.

Stuart reports

"As expected, the more items that were in the repository, the longer an average deposit took to complete.
On average deposits into an empty repository took about one and a half seconds
On average deposits into a repository with three hundred thousand items took about seven seconds
If this linear looking relationship between number of deposits and speed of deposit were to continue at the same rate, an average deposit into a repository containing one million items would take about 19 to 20 seconds.
Extrapolate this to work out throughput per day, and that is about 10MB deposited every 20 seconds, 30MB per minute, or 43GB of data per day.
The ROAD project proposal suggested we wanted to deposit about 2Gb of data per day, which is therefore easily possible.
If we extrapolate this further, then DSpace could theoretically hold 4 to 5 million items, and still accept 2B of data per day deposited via SWORD."

They plan to repeat these tests with ePrints and FEDORA platforms.

Like all such, it's an artificial test, but it does give encouragement that DSpace could scale to handle a data repository for some tasks. I don't know if other issues would be show-stoppers or not, for something like a lab repository, but most of the scale issues seem OK.

Friday, 16 January 2009

Kilbride new Director for Digital Preservation Coalition

I'm very pleased that William Kilbride has been appointed Executive Director of the DPC. I've worked with William for several years, touching on preservation and also geospatial matters. Given the kind of small organisation the DPC is, William is an excellent person to engage with its community and to represent it and its views. The official press release follows:

The Digital Preservation Coalition (DPC) is pleased to announce that Dr William Kilbride has been appointed to the post of DPC Executive Director.

William has many years of experience in the digital preservation community. He is currently Research Manager for Glasgow Museums, where he has been involved in digital preservation and access aspects of Glasgow's museum collections, and in supporting the curation of digital images, sound recordings and digital art within the city's museums.

Previously he was Assistant Director of the Archaeology Data Service where he was involved in many digital preservation activities. He has contributed to workshops, guides and advice papers relating to digital preservation.

In the past William has worked with the DPC on the steering committee for the UK Needs Assessment, was a tutor on the Digital Preservation Training Programme and was a judge for the 2007 Digital Preservation Award.

Although the DPC registered office will remain in York, William will be based at the University of Glasgow. He will take up the post on the 23rd February 2009.

Ronald Milne
Chairman
Digital Preservation Coalition

Thursday, 8 January 2009

Digital Curation Google Group

Interesting Google Group on Digital Curation set up a month or so ago, 25 November 2008 to be exact. Brief is:

"Intended to be a collaborative space for people involved in the work of digital curation and repository development to share ideas, practices, technology, software, standards, jokes, etc."

There's been mostly techie-level discussion on various topics, including the bagit data packaging spec, whether it should include forward error correction (to cope with very long transit time, FedEx style transfers, I think), and (on a quite different level) the concept of "movage", reported by Ed Summers based on conversations at a LoC barcamp with Ryan McKinley, whoever he is, and others. The essence is:

"The only way to archive digital information is to keep it moving."

Which ties in with some thoughts of mine: the best way to preserve information is to keep using it. I don't know how to link to individual posts, but if you go look you'll find it easily enough.

One worth watching!

Christmas offers

As the haze of Christmas (or Holiday Season, if you must) goodwill wanes, and New Year Resolutions already begin to fade, I'd better review what I've been offered so far in my Christmas files quest. I'm not overwhelmed by the numbers, I should say, so you still might tempt me with something else. They include

a thesis and CV on an Amiga floppy disk
some old archaeology project DBs in (probably) dBase format but split over a number of floppies, c.1992
thesis data analysed with KaleidaGraph, on 3.5" double sided, double density floppy disks formatted for the Mac
optionally also some Word for Mac documents (the text and figures of the above thesis)
these disks may also contain the raw data (taken on a 286 using a program this respondent wrote himself) and transferred to the Mac. If it is on the disk, it will be in a format he defined himself so he thought it might fall into the "less interesting" category I suggested. On the other hand, giving it a go might cause us to learn something.
some educational resources produced "some years ago" on an Acorn... they still have the Acorn!
another thesis written in LaTeX using Textures, on very ancient Mac disks.

I'm also hoping to get something from a colleague, following up a related conversation a month or so before Christmas, but he's gone on extended leave so I won't know for a while.

Are these "interesting"? Yes, I think so, although I suspect the most challenging part is getting them off the media. Both Amiga drives and old Mac drives use currently non-standard formats that mean you can't easily read the disks on modern systems (I don't know about the Acorn disks yet). At first I thought we might just aim to find someone with working old hardware (and that still might be an option, but see this tale [link added later, got distracted, sorry!]). But it also turns out there's a controller called the CatWeasel that you can add to a PC and connect an ordinary current drive to, that is supposed to figure out Amiga, Atari, Mac and other formats. It's cheap enough to give a go.

Oh, and thanks to Cliff Lynch for a pointer to the Digital Lives project and it's upcoming conference, see http://www.bl.uk/digital-lives/conference.html; Digital Lives deals with this kind of stuff, and Jeremy Leighton John is just fascinating to listen to. He spoke at one of the DCC workshops, see his slides.

Tuesday, 6 January 2009

Specifications again

The previous post was a summary with relatively little comment from me. I really liked David Rosenthal's related blog post, but I feel I do need to make some comments. I'm not sure this isn't yet another case of furiously agreeing!

Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:

"It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format."

The suggested scenarios re missing renderers are:

"1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for...
2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written...
3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken...
4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers...
5. An open source renderer was written but in the interim was lost...
6. An adequate open source renderer was written, but in the interim stopped working..."

Read David's post for the detail of his arguments. However, I'd just like to suggest a few reasons why preserving specifications might be useful:

- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.
- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.
- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).
- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).

So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac.

I noted a comment from David:

"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."

Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!

So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:

"Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now."

Well, you surely can't use specifications unless they are accessible and have been preserved...

However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!

Email discussion on the usefulness of file format specifications

This is a summary of an email exchange on the DCC Associates email list over a few days in late November, early December. I thought it was revealing of attitudes to preservation formats and to representation information (in the form of both specifications and running code), so I’ve summarised it here. Emails lists are great for promoting discussion, but threads tend to fracture off in various directions, so a summary can be useful. Quotes are reproduced with permission; my thanks to all those involved.

Steve Rankin from the DCC down in Rutherford Labs noticed and drew the list’s attention to the Microsoft pages relating to their binary formats, made available under a so-called "Microsoft Open Specification Promise”.

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx and http://www.microsoft.com/interop/osp/default.mspx

Chris Puttick of Oxford Archaeology pointed out that the pages had been up for a while (since February 2008 at least). He made a couple of interesting points:

“I have it on excellent authority that the specifications are useful but incomplete […]; secondly that as is this not the first time MS have published such information only to take it down again later [so] anyone interested in them should download them as soon as possible. I have on slightly less excellent authority that a ‘promise’ as encased in the [Open Specification Promise] is specifically something in US law and may not have any validity outside of the US.”

Kevin Ashley from ULCC/NDAD agreed:

“It's my understanding - from those who have tried - that earlier specs that MS published failed exactly that test. It wasn't possible to use them to write software that dealt with all syntactic and semantic variations.

“It's a fairly fundamental test for network protocols that one can […] get two separate implementations to communicate with each other. The same is true of file formats, to my mind, and one can see the creating application and the reading application as equivalent to the two ends of a network connection, albeit not necessarily in real time.”

David Rosenthal from Stanford and LOCKSS injected some engineering reality from direct experience into the discussion. He has already released a longer blog post based on the discussion and his contribution; effectively he seemed to be aiming to demolish the argument for keeping specifications at all.

“Speaking as someone who has helped implement PostScript from the specifications, I can assure you that published specifications are always incomplete. There is no possibility of specifying formats as complex as CAD or Word so carefully that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same code. And there is no possibility of digital preservation efforts being able to afford the investment to do high-quality clean-room implementations of these complex formats. Look at the investment represented by the Open Office suite, for example.

“On the other hand, note that Open Office and other open source office suites in practice do an excellent job of rendering MS formats, and their code thus represents a very high quality specification for these formats. Code is the best representation for preservation metadata.”

Colin Neilson from DCC SCARP wondered what the implications of incomplete specifications were for the concept of Representation Information in OAIS (RepInfo is often associated in examples with specifications).

He wrote:

“I am interested in implications for areas (such as CAD software) where proprietary (secret sauce) formats are historically the norm. Is the legacy of digital working always preservable within an OAIS framework? […] Are there some limits in using an OAIS model if some "specifications" are inadequate or information is not available?”

and in a later message

“Do we need to have "access software" preserved (long term) if the other representation information is less complete in the case where standards for proprietary file formats (say like Microsoft word DOC format) are to a degree incomplete, less adequate or not available (perhaps more so in the case of older versions of file formats)?”

Personally I think one of the advantages of Open Office is that it is not just Access Software, but Open Source Access Software. This should give it much greater longevity. But of course, such alternatives don’t exist in many areas, including many of the CAD formats Colin is concerned about.

Alan Morris from Morris and Ward asked the obvious question:

“Who would even consider utilizing WORD as a preservation format?”

… and got a surprising answer, from Peter Murray-Rust from the eponymous Cambridge research group!

“I would, and I argued this in my plenary lecture at OpenRepositories08. Not surprisingly it generated considerable discussion, from both sides.

“First the disclaimer. I receive research funding (though not personal funding) from Microsoft Research. Some of you may wish to stop reading now! But I don't think it colours my judgment.

“My argument was not that Word2007 should be the only format, but that it should be used in conjunction with formats such as PDF. We have a considerable amount of work on [depositing] born-digital theses and we have recommended that theses should be captured in their original format (OOXML, ODT, LaTeX, etc.) as well as the PDF.

“I am a scientist (chemist) but generally interested in all forms of STM data (for example we collaborated in part of the KIM project mentioned a few emails ago). If you believe that preservation only applied to the holy "fulltext", stop reading now. However I think many readers would agree that much of the essential information in STM work (experiments, data, protocols, code, etc.) is lost in the process of publication and reposition. Very frequently, however, the original born-digital work contains semantic information which can be retrieved. For example OOXML and ODT allow nearly 100% of chemical information (molecular structures) to be retrieved (in certain circumstances), whereas PDF allows 0% by default. (It is possible, though extremely difficult and extremely lossy, to turn PDF primitives back into chemistry)

“Note that we also work on Open Office documents and have a JISC-sponsored collaboration with Peter Sefton [of the Australian Digital Futures Institute of USQ in Australia] on his excellent ICE system. We are exploring how easy it is to author chemistry directly into an ODT document and by implication into any compound semantic document (note that XML is the only practical way of holding semantics). […]”

“We've looked into using PDF for archiving chemistry and found that current usage makes this almost impossible. So we work with imperfect material.

“Note that Word2007 can emit OOXML that can be interpreted with Open Source tools. The conversion is not 100%, but whatever is? […]”

“I wonder whether the all the detractors of OOXML have looked at it in detail. Yes, it is probably impossible to recreate all the minutiae of typesetting, but it preserves much of the embedded information that less semantic formats (PDF and even LaTeX) do not. If I have no commercial software and someone gives me a PDF of chemistry and someone else gives me OOXML I'd choose the OOXML. HTML is, in many cases, a better format than PDF.

“So my suggestion is simple. Use more than one document format. After all do we really know what future generations want from the preservation process. It costs almost nothing as we are going to have to address compound documents and packaging anyway.”

An anonymous contributor suggested that the appropriate course was to structure AIPs to contain both original source format and the preservation format. In the future, he asserted, better tools may exist to take the original source format and render a more completely accessible preservation format, particularly bearing in mind scientific notation.

Finally, Geoffrey Brown from Indiana also argued in favour of keeping the original (and against NARA policy):

“The Bush administration as well as various companies managed to embarrass themselves with inadvertently leaked information in the form of edit histories in word documents. Migration will likely (who knows ?) discard such information unless special care is taken in developing migration tools.

“I am uncomfortable with the assumption that we can abandon the original documents as NARA seems to be doing by requiring(?) agencies to submit documents in PDF). The edit histories are part of the historical record; however, it's safe to say that most patrons will be satisfied with the migrated document.

“Digital repositories have an obligation to figure out how to preserve access to documents in any format and not use format as a gatekeeper.”

So… running code is better than specs as representation information, and Open Source running code is better than proprietary running code. And, even if you migrate on ingest, keep BOTH the rich format and a desiccated format (like PDF/A). It won’t cost you much and may win you some silent thanks from your eventual users!

Digital Curation Blog