Digital Curation Blog: Digital Preservation

Showing posts with label Digital Preservation. Show all posts

Thursday, 6 October 2011

Thoughts before "The Future of the Past of the Web"

Tomorrow I'm going to be in London for a joint JISC/DPC event on web archiving, "The Future of the Past of the Web" (hashtag #fpw11 if you're so inclined.) It's the third in an occasional series; I gave the closing presentation at the second event and I have been asked to be on a closing panel this time round. One of the things we've been asked to reflect on is what changes have taken place since the last event and how far our expectations have been realised. I thought it would be useful to set my thoughts on this down in advance, partly to help me articulate my own thinking. It will be interesting to see how various views develop during the panel session tomorrow.

Image Courtesy Martin Dodge's Cybergeography collection

Looking back at my concerns in mid-2009 I'm greatly reassured. There were a number of worrying trends apparent in web archives at that time and an apparent lack of bold vision in how we might use web archives in the future - or even in the present. My fear was that the collecting policies, preservation policies and interfaces offered were all taking a very human and document-centric view of what a web archive should do. In OAIS terms, the Designated Community was people who wanted to view individual old web pages having done a search for a particular site, or possibly for a keyword of some sort. The National Archives had taken one incremental but powerful step beyond that, automatically linking archived web pages to 404 pages on government web sites via simple plugins for Apache & IIS, but in the end this still involved serving individual pages for people to read.

That's a valid use case, but by no means the only ones. I set out a few other things we might want to be able to do but could not with the interfaces that web archives gave us.

What search results would we have got on the web of 1998 using the search engines of 1998?

What results would we have got using current search engines on the web of 1998?

How can we visualise the set of links to or from a particular site changing over time?

Treating the web as a corpora of text over time how can we track the emergence of words or concepts and their emergence from specialist vocabulary to general use?

As historians of technology, how we can use a web archive to track things like the emergence of PNG as an image format and the decline of XPM (the original icon format for graphical browsers such as Mosaic)?

I also wanted to show how open APIs or RESTful interfaces can allow others to develop innovative ways to view content. Since there weren't any web archives with such interfaces I fell back on demonstrating the point with Flickr, more particularly with simple visual beauty that is TagGalaxy. TagGalaxy shows how the ability to search and retrieve images and tags lets someone else build a completely different interface to the Flickr repository, one which minimises textual interaction and which encourages serendipitous discovery. It would have been wonderful to be able to do that with a web archive. Similarly, if Brian Kelly had been able to say to the Internet Archive 'give me all the versions of the home page of the University of Bath between these dates' in a single interaction, it would have been much easier for him to build the informative animation he used in his own presentations for JISC PoWR. I could go on, and at the time I did.

Much of what I hoped for then has happened. The architecture of Memento makes it straightforward to view collections of web archives as a single entity from some viewpoints. Projects funded by "Digging Into Data" have shown the power of large web collections in viewing the web as data at many levels. And although most (all?) web archives are not yet offering the APIs or interfaces that would permit us to do some of the things above, I think they at least accept that these are valid aspirations.

Moreover, web archiving has moved from being a specialist concern to something that appears in the letters pages of national newspapers. That, and the type of talks we're going to hear tomorrow, show how far we've moved in 2 1/2 years. I'm quietly confident that things are getting better.

Tuesday, 9 March 2010

A Blue Ribbon for Sustainability?

When we talk about long term digital preservation, about access for the future, about the digital records of science, or of government, or of companies, or the designs of ships or aircraft, the locations of toxic wastes, and so on being accessible for tens or hundreds of years, we are often whistling in the dark to keep the bogeys at bay. These things are all possible, and increasingly we know how to achieve them technically. But much more than non-digital forms, the digital record needs to be continuously sustained, and we just don’t know how to assure that. Providing future access to digital records needs action now and into that future to provide a continuous flow of the necessary will, community participation, energy and (not least) money. Future access requires a sustainable infrastructure. Ensuring sustainability is one of the major unsolved problems in providing future access through digital preservation.

For the past two years I have been lucky enough to be a member of the grandly named Blue Ribbon Task Force on Sustainable Digital Preservation and Access, along with a stellar cast of experts in preservation, in the library and archives worlds, in data, in movies… and in economics. C0-chaired by Fran Berman (previously of SDSC, now of RPI) and Brian Lavoie of OCLC, the Task Force produced an Interim Report (PDF) a year ago, and has just released its Final Report (Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information, also PDF). (The Task Force was itself sustained by an equally stellar cast of sponsors, including the US National Science Foundation and the Andrew W. Mellon Foundation, in partnership with the Library of Congress, the UK’s JISC, the Council on Library and Information Resources, and NARA.)

Sustainability is often equated to keeping up the money supply, but we think it’s much more than that. The Task Force specifically looks at economic sustainability; it says early in the Executive Summary that it’s about

“… mobilizing resources—human, technical, and financial—across a spectrum of stakeholders diffuse over both space and time.”

If you want a FAQ on funding your project over the long term you won’t find it here. Nor will you find a list of benefactors, or pointers to tax breaks, or arguments for your Provost. Instead you should find a report that helps you think in new ways about sustainability, and apply that new thinking to your particular domain. For one of our major conclusions is that there are no general, across the board answers.

One of the great things about this Task Force was its sweeping ambition. Not just content with bringing together a new economics of sustainable digital preservation, but thinking so broadly. This was never about some few resources, or this Repository or that Archive, it was about the preservation and long term access of major areas of our intellectual life, like scholarly communication, like research data, like commercially owned cultural content (the movie industry is part of this), and the blogosphere and variants (collectively produced web content). Looking at those four areas holistically rather than as fragments forced us to recognise how different they are, and how much those differences affect their sustainability. They aren’t the only areas, and indeed further work on other areas would be valuable, but they were enough to make the Task Force think differently from any activity I have taken part in before.

The report is, to my mind, exceedingly well written, thanks to Abby Smith Rumsey; it far exceeds the many rather muddled conversations we had during our investigations. It has many quotable quotes; among my favourites is

“When making the case for preservation, make the case for use.”

Reading the report is not without its challenges, as you might expect. It has to marry two technical vocabularies and make them understandable to both communities. I’ve been living partly in this world for two years, and still sometimes stumble over it; I remember many times screwing up my forehead, raising my hand and asking “Tell us again, what’s a choice variable?” And the reader will have to think about things like derived demand for depreciable durable assets, nonrival in consumption, temporally dynamic and path-dependent, not to mention the free rider problem. These concepts are there for a reason however; get them straight and you’ll understand the game a lot better.

And there are not surprisingly big underlying US-based assumptions in places, although the two resident Brits (myself and Paul Ayris of UCL) did manage to inject some internationalism. Further work grounded in other jurisdictions would be extremely valuable.

Overall I don’t think this report is too big an ask for anyone anywhere who is serious about understanding the economic sustainability of digital preservation and future access to digital materials. I hope you find the great value that I believe exists here.

Monday, 19 October 2009

SUN PASIG: October 2009

As readers of this blog may have guessed, I was in San Francisco for the iPres 2009 Conference (17 blog posts in 2 days is something of a personal record!). This conference was followed by several others, including the Sun Preservation & Archiving SIG (Sun-PASIG), from Wednesday to Friday. I didn't feel quite so moved to blog the presentations as at iPres (and I was also knackered, not to put too fine a point on it). But I did not want to pass it b y completely unremarked, particularly as I really like the event. This is the second Sun-PASIG meeting I've attended, following one in Malta in June of this year (see two previous blog posts).

It's a very different kind of meeting from iPres. The agenda is constructed by a small group, forcefully led by Art Pasquinelli of Sun and Michael Keller of Stanford. The presentations are just that; not papers. This let's them be more playful and pragmatic, also more up-to-date. Of course, there's a price to pay for a vendor-sponsored conference, although I won't reveal here what it is!

Tom Cramer has put up the slides at Stanford, so you can explore things I was less interested in. In the first session, the presentation that really grabbed me was Mark Leggott from Prince Edward Island (I confess, guiltily, I don't really know where this is) talking about Islandora. This is a munge of Fedora and Drupal, with a few added bits and bobs. It looked like a fantastic example of what a small, committed group with ideas and some technical capability can do. Nothing else on day 1 caught my imagination quite so strongly, although I enjoyed Neil Jeffries' update on activities in Oxford Libraries, and Tom Cramer's own newly pragmatic take on a revised version of the Stanford Digital Repository.

On day 2 there were lots of interesting presentations. Of particular interest perhaps was the description of the University of California Curation Center's new micro-services approach to digital curation infrastructure. I'm not quite sure I get all of this, mainly perhaps as so much was introduced so quickly; however as I read more about each puzzling micro-service, it seems to make more sense. BTW I congratulate the ex-CDL Preservation Group on their new UC3 moniker! 'Tis pity it came the same week as the New York Times moan about overloading the curation word (here if you are a registered NYT reader)...

I also very much liked the extraordinary presentation by Dave Tarrant of Southampton and Ben O-Steen of Oxford on their ideas for creating a collaborative Cloud. Just shows what can be done if you don't believe you can't! The slides are here but don't give the flavour; you just had to be there.

In a presentation particularly marked by dry style and humour, Keith Webster of UQ talked about Fez, and shortly after Robin Stanton of ANU talked about ANDS; both very interesting. The day ended with a particularly provocative talk by Mike Lesk, once at NSF for the Digital Library Initiatives, now at Rutgers. Mike's aim was to provoke us with increasingly outrageous remarks until we reacted; if he failed to get a pronounced reaction, it was more to do with the time of day and the earlier agenda. But this is a great talk, and mostly accessible from the slides.

On the 3rd day, we had a summing up from Cliff Lynch, interesting as ever, followed by breakouts. I went to the Data Curation group (surprise!), to find a half dozen folk, apparently mostly from IT providers, very concerned about dealing with data at extreme scale. It's a big problem (sorry), but not quite what I'd have put on the agenda. But in a way it typifies Sun-PASIG: never quite what you thought, always challenging and interesting.

Shortly thereafter I had to leave, but in the middle of a fascinating discussion about the future of Sun-PASIG, particularly with the shadow of the Oracle acquisition looming. I certainly believe that the group would be useful to the new organisation, and very much hope that it survives. Next year in Europe?

Tuesday, 6 October 2009

iPres 2009: Pennock on ArchivePress

Blogs are a new medium but an old genre, witness Samuel Pepys’ diaries for instance (now also a blog!). But since they are web based, aren’t they already archived through web archiving? However, simple web archiving treats blogs simply as web pages; pages that change but in a sense stay the same. Web archiving also can’t easily respond to triggers, like RSS feeds relating to new postings. Web archiving approaches are fine, but don’t treat the blogs as first class objects.

New possibilities can help build new corpora for aggregating blogs to create a preserved set for institutional records and other purposes. ArchivePress is a JISC Rapid Innovation (JISCRI) project, which once completed will be released as open source. The project started with a small 10-question survey, for which the key question was: which parts of blogs should archiving capture. In descending order the answers were posts, comments, tag & category names, embedded objects, and the blog name & URLs. These findings were broadly in agreement with an earlier survey 9see paper for reference).

Set out to find the significant properties of blogs. Significant properties, they see as in the eye of the stakeholder. First round this includes content (posts, comments, embedded objects), context (including authors & profiles), structure, rendering and behaviour.

To achieve this, they build on the Feed plugin for WordPress, which gathers the content as long as a RSS or Atom feed is available. WordPress is arguably the most widely used, it’s open source, it’s GPL and it has publicly available schemas.

Maureen showed the AP1 demonstrator based on the DCC blogs [disclosure: I’m from the DCC!], including blog posts written today that had already been archived. The AP2 demonstrator (the UKOLN collection) will harvest comments, and resolving some rendering and configuration issues from AP1; and will allow administrators to add new categories (tags?).

It seems to work; there turned out to be more variations in feed content than expected. Configuration is tricky, so must make it easier.

Monday, 10 August 2009

Forgetting to remember

After a Sunday Times article prompted yesterday's piece of whimsy, a Tweet from my standard Twitter search ( (digital OR data) AND (preservation OR curation), since you ask) produced an interesting article by Chris O'Brien, a columnist for MercuryNews.com: "Time to clean up your digital closet". He goes quite nicely through the various ways in which our personal digital content is more at risk than we might think (media degradation, device and format obsolescence, and the sheer anonymity of large quantities of digital stuff). But he has a prescription for dealing with some of it, part of which I reproduce here (I hope fair use covers this, since you'll have to go to the original to read the rest!):

"However, all is not lost. There are some strategies for storing your digital archives. But you'll have to do a lot of work. You will need to start thinking like a librarian and become an active curator of your files. That means relentlessly organizing, labeling and tagging, backing up and deleting.

The first and most important thing to do is to begin deleting files. Whittle things down to the essentials. What do you really want to maintain and pass along? You must be ruthless and vigilant.

Next, develop a system for organizing files online and offline. If you're going to store stuff on removable media, like DVDs, place them in cases that have extensive labels, and index them. And don't store files like text documents or photos on propriety formats that are not widely adopted. Experts recommend photos in JPG forms and documents in PDF formats or basic text formats.

Label every file and tag them with as much information as you can. Being obsessive now will pay off in the long run. This is a lot of work, which is why you want to cull your archives as much as possible.

Once that's done, make multiple copies. You can also explore "cloud" backup services..."

Thinking like a librarian? Being an active curator of your files? Sounds like a good place to start. Interesting that he sees deleting as being an important part of remembering! We probably need better tools for the average person for a lot of this (eg tagging files in a filestore), but I suspect there's enough around for any reasonably competent researcher to use. However, laziness, forgetfulness and sheer pressure of work are our enemies here. Will we forget to do the things needed to remember?

Sunday, 9 August 2009

Remembering to forget

Are we getting this digital preservation thing all wrong? An article in today's Sunday Times quotes Viktor Mayer-Schonberger that we're creating a "digital memory that vastly exceeds the capacity of our collective human mind". James Harkin reports that Viktor wants us to forget more. Mind you, the main (and earliest) example given concerns an unfortunate photo from as recently as 2006. I'm sure the lady concerned would have remembered it only too well, whether or not the Internet example had come to light.

It's a daft story, but there is an interesting angle. Many preservation "systems" carry the risk that things will be preserved that some would prefer forgotten (eg the famous Bush speech). When the powerful want to change the record, the Web both facilitates and resists them. Web sites (and archives) are generally under some kind of centralised control, and subject to pressure, which they may or may not be able to resist. There are rumours of web-based reports being retrospectively "fixed". But once reports have got out into the wild, as it were, it is much harder to "fix" them, as the example above shows.

This doesn't mean that archives are a bad thing. They bring professional standards to the keeping of history. But perhaps it's a Good Thing that there's an uncontrollable un-system of citizens keeping (probably illegal) copies of some important and uncomfortable records. Even if it does mean that the lady's embarrassing photo stays around longer than she would like!

Tuesday, 28 July 2009

Turmoil in discourse a long term threat?

Lorcan Dempsey mentioned a meeting with Walt Crawford, whom I don't know, in the light of his feeling that "some of the heat had gone out of the blogosphere in general", and reported:

'Walt, whom I was pleased to bump into [...], is probably right to suggest in the comments that some energy around notifications etc has moved to Twitter: "Twitter et al ... have, in a way, strengthened essay-length blogging while weakening short-form blogging (maybe)-and essays have always been harder to do than quick notes"'

That ties in to my experience to some extent. I've just published a blog post from Sun-PASIG in Malta, which ended a month ago (not really an essay, but something where it was hard to get the tone just right), and I have a bunch of other posts in the "part-written" pipeline. Tweets are a lot easier.

But that isn't quite my point here. I'm a little concerned that the new "longevity" threat may not be the encoding of our discourse in obsolete formats, and not even our entrusting it to private providers such as the blog systems (as long as it IS open access, and preferably Creative Commons). The threat may be the way new venues for discourse wax and wane with great rapidity. We can learn to deal with blogs, we can even have a debate on whether the twitterverse is worth saving (or how much of it might be). Do we need to worry about other more social media (MySpace, Facebook, Flickr and so many lesser pals; so heavily fractured)? They're not speech, they're not scholarly works, but they have some significance (particularly in documenting significant events) somewhere in between. We could learn to deal with any small set of them, but by the time we work out how they could be preserved, and how parts might be selected, that set would (as is suggested above for blogs) already be "so last year".

BTW, part of this space is being addressed by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access. I'm attending one of their meetings over the next two days, on my first visit to Ann Arbor, Michigan. Among the things we're looking at are scenarios that currently include social media. I'll try and write a bit more about it, but it's not really the sort of meeting you can blog about freely...

Rosenthal at Sun-PASIG in Malta

I was very pleased to hear David Rosenthal reprise his CNI keynote on digital preservation for the Sun-PASIG meeting in Malta, a few weeks ago now. David is a very original thinker and careful speaker. I’ve fallen into the trap before of mis-remembering him, and then arguing from my faulty version. I even noted two tweets made contemporaneously with his talk, that misquoted him and changed the meaning subtly (see below). Luckily, David has made his CNI presentation available in an annotated version on his blog, so I hope I don’t make the same mistake again.

If you were not able to hear this talk, please go read that blog post. David has some important things to say, pretty much all of which I agree strongly with. No real surprise there, as part of the talk at least echoes concerns I expressed in the “Excuse Me…” Ariadne article (Rusbridge, 2006), which on reflection was probably influenced by earlier meetings with David among others.

So here’s the highly condensed version: Jeff Rothenberg was wrong in his famous 1995 Scientific American article (Rothenberg, 1995). The important digital preservation problems for society are not media degradation or media obsolescence or format obsolescence, because important stuff is online (and more or less independent of media), and widely used formats no longer go obsolescent the way they used to when Jeff wrote the article. The important issue is money, as collecting all we need will be ruinously expensive. Every dollar we spend on non-problems (like protecting against format obsolescence) doesn’t go towards real problems.

And if you are so imbued with conventional preservation wisdom as to think that summary is nonsense, but you haven’t read the blog post, go read it before making up your mind!

David concludes:

"Practical Next Steps
Everyone - just go collect the bits: Not hard or costly to do a good enough job, Please use Creative Commons licenses

Preserve Open Source repositories: Easy & vital: no legal, technical or scale barriers

Support Open Source renderers & emulators
Support research into preservation tech: How to preserve bits adequately & affordably? How to preserve this decade's dynamic web of services? Not just last decade's static web of pages"

So what are the limitations of this analysis? My quick summary from a research data viewpoint:

Lots of important/valuable stuff is not online

Quite a lot of this stuff is not readable with common, open-source-compatible software packages

We need to keep contextual metadata as well as the bits for a lot of this stuff… and yes, we do need to learn how to do this in a scalable way.

David clearly concentrates on the online world:

“Now, if it is worth keeping, it is on-line
Off-line backups are temporary”

However, it’s worth remembering Raymond Clarke’s point in my earlier post from PASIG Malta about the cost advantages of offline. Particularly in the research data world, there is a substantial set of content that exists off-line, or perhaps near-line. Some of the Rothenberg risks still apply to such content. Let’s leave aside for the moment that parallels to the scenario that Rothenberg envisages continue to exist: scholars’ works encoded in obsolete digital media are starting to be ingested in archives. But more pressingly, some research projects report that their university IT departments discourage them from using enterprise backup systems for research data, for reasons of capacity limitations. So these data often exist in a ragbag collection of scarcely documented offline media (or may even be not backed up at all). In Big Science, data may be better protected, being sometimes held in large hierarchical storage management systems. A concern I have heard from the managers of such large systems is that the time needed to migrate their substantial data holdings from one generation of storage to the next can approximate the life of the system, ie several years. And clearly such systems are more exposed to risk.

Secondly, David’s comments about format obsolescence apply specifically to common formats. He says “Gratuitous incompatibility is now self-defeating”, and “Open Source renderers [exist] for all major formats” with “Open Source isn't backwards incompatible”. But unfortunately there are examples where there are valuable resources that remain at risk. There are areas with valuable content not accessible with Open Source renderers (eg engineering and architectural design). There are many cases in research where critical analysis codes are written by non-experts, with poor version control, poorly documented. And even in the mainstream world, format obsolescence can still occur in minority formats, for all sorts of reasons, including bankruptcy, but also including sheer bad design of early versions.

Finally, I’m sure David didn’t really mean “just keep the bits”. Particularly in research, but in many other areas as well, important contextual data and metadata are needed to understand the preserved data, and to demonstrate its authenticity. The task of capturing and preserving these can be the hardest part of curating and preserving the data, precisely because those directly involved need less of the context.

Oh, that double mis-quote? Talking of the difficulty of engaging with costly lawyers, David said “1 hour of 1 lawyer ~ 5TB of disk [-] 10 hours of 1 lawyer could store the academic literature”. One tweet reported this as “Lawyer effects; cost of 10 lawyer hours could save entire academic literature!” and the other as “10 hours of a lawyer's time could preserve the entire academic literature”. See what I mean? Neither save nor preserve mean the same as store!

Overall, David does a great job, in his presentation, blog post and other writings, in reminding us not to blindly accept but to challenge preservation orthodoxy. Put simply, we have to think for ourselves.

Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=9501173513&site=ehost-live

(yes, that URL IS the "permanent URL according to Ebsco!)

Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.

Thursday, 4 June 2009

SNIA "Terminology Bridge" report

Quite a nice report has just been published by the Storage Networking Industry Association's Data Management Forum, called "Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation Practices in the Datacenter". I don't think I'd agree with everything I saw on a quick skim, but overall it looks like a good set of terminology definitions.

The report identifies "two huge and urgent gaps that need to be solved. First, it is clear that digital information is at risk of being lost as current practices cannot preserve it reliably for the long-term, especially in the datacenter. Second, the explosion of the amount of information and data being kept long-term make the cost and complexity of keeping digital information and periodically migrating it prohibitive." (I'm not sure that I agree with their apocalyptic cost analysis, but it certainly deserves some serious thought!)

However, while still addressing these large problems, they found that what "began as a paper focused at developing a terminology set to improve communication around the long-term preservation of digital information in the datacenter based on ILM[*]-practices, has now evolved more broadly into explaining terminology and supporting practices aimed at stimulating all information owning and managing departments in the enterprise to communicate with each other about these terms as they begin the process of implementing any governance or service management practices or projects related to retention and preservation."

It's worth a read!

(* ILM = Information Lifecycle Management, generally not related to the Curation Lifecycle, but oriented towards management of data on appropriate storage media, eg moving less-used data onto offline tapes, etc.)

Thursday, 14 May 2009

OAIS version for public examination

Thanks to David Giaretta for the following information on the state of the revision to OAIS (I have commented earlier on this process):

OAIS version for public examination

Many comments and ideas for clarifications and improvements for OAIS were received as part of its 5 year review process.

These suggestions were reviewed and the proposed dispositions sent to their originators for further comment. This draft version of OAIS contains these and many other improvements and is the candidate for submission to ISO for review. At this stage we are seeking primarily to identify errors rather than further ideas.

The PDF file is available at http://cwe.ccsds.org/moims/docs/MOIMS-DAI/Draft%20Documents/OAIS-candidate-V2-markup.pdf

Please send corrections to oais-support@oais.info by 15 June 2009

(NB there are some cross-reference errors which will be corrected in the final version)

Shortly after this date the corrected OAIS update will be sent to ISO and in due course this will be released for international review at which point further comments may be submitted.

John Garrett (chair) David Giaretta (deputy-chair)
DAI-WG CCSDS

Sunday, 19 April 2009

Amiga disk data recovery: progress and limitations

You may remember that I have been attempting to recover files and content from various sources from 10 or more years ago. One of these was an Amiga disk. On the label is the note: "Dissertation 17/4/96, CV.asc, CV, 29 September 1996".

I’ve described earlier some attempts to get the Catweasel controller to read the disks. After eventually figuring out how to configure the disk-reading program ImageTool3 for the Catweasel, I tried the Amiga disk. It worked fine, with as far as I can see zero errors. From a cursory scan of Google, I reckon this raw disk format is known as ADF, so I renamed it XXXAmiga.adf (.adf was one of the candidate extension names under the selected "Plain" category for the ImageTool3 program).

Now, of course, we have to work out first how to extract files from the disk image, and then how to convert your particular file format into a modern day format.

Just simply reading the raw disk image with Notepad on Windows or Textedit (on my Mac) shows that there is real text there, that made sense to my colleague (see below)!

A comment from “Euan” on my earlier post suggested that we try the WinUAE Amiga emulator, and my colleague did that. He reported:

“Success. I've not only got the WinUAE Amiga emulator working, but managed to find a copy of the application that I wrote my CV and dissertation in (Final Writer 5) and have been able to read the files off the disk image you sent and display them (screenshots attached).

Not having any luck reading the individual files directly [CR: from his Windows system], though -- other than the odd word related to fonts and colours -- but then they are in native FW5 format.”

Image from the dissertation seen in Final Writer

I asked if he was able to do any "save as" operations in his emulated Final Writer program, to move files from the disk image into the Windows file store. He reported:

“I've tried re-saving my files in another format, if that's what you mean, but the program doesn't do anything -- I can select Save > Save As... from the menu but nothing happens. However, I can see all my individual dissertation files from my PC as the file system is mapped onto a directory.”

Raw image from the same part of the dissertation grabbed from Textedit on the Mac

It was remarkable how much could be read directly in the disk image!

Now my colleague was able to read his CV.asc file with Notepad on Windows, but so far we have not been able to convert the dissertation to a modern format, nor to connect the Final Writer program inside the emulator to a printer. Frustratingly close, but still not quite where we would like to be. I did find a demo copy of Final Writer for Windows 95 on the WayBack machine (earliest lift of the site, also 1996), but unfortunately this wouldn't open the existing image unless we upgrade to the full-featured version... but the company appears to have gone bust in 1996-7 or thereabouts!

So what have we learned from this?

It is possible to read a 13 year old floppy disk from an obsolete machine with an apparently incompatible disk format, kept under conditions of less than benign neglect, using cheap hardware on a recent Windows PC.
It is possible to access the files from the obsolete operating system using an emulator that appears to have been written by spare time volunteers.
It is possible to run the original application that created some of these files, under the emulator, and to read and process them (but not, so far, to save in another format).
Using the emulator is valuable, but constraining (in being unfamiliar technology, with few manuals etc) and limiting (in not, so far, being able to do much more with the files). We would now like to migrate them to a modern environment; for my colleague, this means Windows or Linux.

Fascinating!

Wednesday, 1 April 2009

An update on my data recovery efforts

You may remember that after our Christmas party late last year, I wrote a blog post offering to have a go at recovering some old files, if anyone was interested. A half dozen or so people got in touch, one with 30 or so old Mac disks, someone with a LaTeX version of a thesis on an old Mac disk, a colleague (who started all this, really) with a dissertation on an old Amiga disk, and someone with a CD from an Acorn RISC PC, plus a few others.

I frightened some off by giving them a little disclaimer to agree to, but others persisted and sent me their media. So, problem number 1, how to read old Amiga and Apple floppy disks? Both of these have a different structure from PC-compatible floppies, for example the old Mac disks store more information on the outer tracks than the inner ones, thus packing more data onto the disk.

The answer seemed to be a special controller for a Windows computer, that links to a standard floppy drive. The controller is called the Catweasel Mk4, from a German company, Individual Computers. We ordered one and it arrived quite quickly, well before I had managed to borrow a Windows box to experiment with. The card didn't physically fit in the first system, and then there was a long wait while we found a spare monitor (progress was VERY slow when I had to disconnect and re-use my Mac monitor). Then bid-writing intervened.

Eventually we got back to it, but had lots of problems configuring the controller properly; the company was quite good at providing me with advice. Finally a couple of days ago I finally got the config file right.

I wasn't using any of the contributed disks for testing, but instead some old DOS disks that I had from my days in Dundee (1992-4). Early on we did manage to read these, with rather a lot of errors, but lately we have got zero good sectors off these disks. I'm still not sure why; I'm inclined to blame my attempt to read one of the disks using the Windows commands on the same drive (there's a pass-through mode); I never managed to get any DOS disks to read a single track after that!

Well, today I stuck in an old Mac disk I had... and lo and behold, it was reading with a fair proportion of good sectors. So, let's try the Amiga disk: 100% good (well, maybe one bad sector). And the Mac disk with the LaTeX file on it: pretty good, 1481 good sectors out of 1600.

The problems aren't over yet. All you get from the ImageTool3 program that works through this controller is a disk image. So now I'm looking for a Windows (or Mac) utility to mount an Amiga file system, so we can copy the files out of it. And ditto for the Mac file system (written circa 1990 I think; I'm assuming it's HFS, but don't know).

At that point, depending on the amount of corruption, the Mac job should be pretty much done; my contributor still understands LaTeX and can probably sort out his old macros. For the Amiga files, there will be at least one further stage: identifying the file formats (CV.asc is presumably straight ASCII text, but the dissertation may be in a desktop publishing file format), and then finding a utility to read them.

It's been slow but interesting, and I've been quite despondent at times (checking out the data recovery companies, that sort of thing). But now I'm quite excited!

These two contributors have got their disk images back, and may have further ideas and clues. But can you help with any advice?

Tuesday, 31 March 2009

More on the ICTHES journals

I've had 3 responses by email to yesterday's post on the ICTHES journals (some responding to an associated email from me on the same issue). I'll summarise the two where quote permission was not explicit, and quote the third at length.

Adam Farquhar of the BL told me he had discussed it with their serials processing team under the voluntary scheme for legal deposit of digital material, and they will download the material into the BL's digital archive, where it will become accessible in the reading rooms (in due course, I guess). Wider access to such open access material should be available later under their digital library programme.

Tony Kidd, of the University of Glasgow and UKSG suggested that an OpenLOCKSS type approach might be feasible. This is consistent with the email from Vicky Reich of LOCKSS; she told me I could post her response. So here it is:

"UK-LOCKSS can, and should, preserve the four ICTHES journals.
First step: Contact the publisher and ask them to leave the content online long enough for it to be ingested.
Second step: Ask the publisher to put online a LOCKSS permission statement.
Third step: Someone on the LOCKSS team does a small amount of technical work to get content ingested.
With these minimal actions, the content would be available to those institutions who are preserving it in their LOCKSS box.

If librarians want to rehost this O/A content for others, there are two additional requirements:
a) the content has to be licensed to allow re-publication by someone other than the original copyright holder. This is best done via a Creative Commons license.
b) institutions who hold the content have to be willing to bear the cost of hosting the journals on behalf of the world.
Librarians, even those who advocate open access have not taken coordinated steps to ensure the OA literature remains viable over the long term. Librarians are motivated to ensure perpetual access to very expensive subscription literature, but ensuring the safety of the OA literature is not a priority because... it's available, and it's free. [...]

When the majority of librarians who think open access is a "good idea" step up and preserve this content (and I don't mean shoving individual articles into institutional repositories), then we will be well on our way to building needed infrastructure"

See also the comment from Gavin Baker to yesterday's post, which i think backs up Vicky's last point:

"I've thought for a while that archiving OA journals should be a goal of the library and OA community, maybe via a consortium which would harvest new issues of journals listed in the DOAJ. (We can treat as separate, for these purposes, the question of short-term archiving in case a journal goes under from the question of long-term preservation.) Is there a reason why this approach isn't undertaken? Do people assume that any OA journal worth archiving is already being archived by somebody somewhere?"

Let's be quite clear, contrary to my simplistic assumptions, the Internet Archive is NOT undertaking this task!

Monday, 30 March 2009

Charity closing, possible loss of 4 OA titles

I note from Gavin Baker's [not Peter Suber's; my mistake- CR] blog entry that the charity ICTHES is closing, and as a result its 4 OA journals, listed may disappear. I have checked the Internet Archive, and in case we should be complacent about that as a system of preservation, found only 1 issue out of 18 issues from 4 titles had actually been gathered there.

The Journals are

I see from Suncat that these titles are variously held by BL, Cambridge, Oxford and NLS, so I guess they are regarded as serious titles.

Since UKSG is now in progress, I wondered if I could challenge UKSG on what it (or we, the community) can and/or should and/or will do about this! Would there be any opportunity in the programme to discuss this? (BTW unfortunately I am not able to come to Torquay, so I'm niggling, and indeed watching the #uksg tweets, from a distance.)

Options for action that I can see include

a) some kind of sponsored crawl by Internet Archive

b) an emergency sponsored crawl by UKWAC or one of its participants (which may of course already have happened),

c) an urgent approach by a group of those participating in LOCKSS for the charity to join the programme (which may be stymied by lack of development effort and time), would only make available to participants, I think

d) Ditto for CLOCKSS, which at least might have the resources to make available publicly on a continuing basis,

e) sponsored ingest into something like Portico; again, only available to participants as I understand it

f) tacitly suggest libraries grab copies of the 18 or so PDFs, or

g) get a group of libraries to offer to host a historical archive of the titles for the charity...

h) appraise the titles as not worth preserving, and consign to the bitbin of history

i) ummm, errr, dither...

PS this blog entry is based on an email sent to the conference organisers and others, unfortunately after the conference has started. I have already had one response, from the BL, suggesting they would discuss with their journals people...

Tuesday, 10 March 2009

Obsolete drives; sideways thinking?

I’ve been trying to write this post for ages, but the draft never seemed right, so this is starting all over again, blank sheet.

Many people have a bunch of stuff on media for which it’s hard to find a working drive, whether disk or tape, or punched card or paper tape, or… A feature of the response to my “12 files for Christmas” post was that those who responded have their interesting stuff locked away on such media. There may be other challenges to reading it, but the first is getting the content off the media.

We tend to go all gloom and doom about this. I’ve got stuff on Iomega Jaz cassettes from an earlier Mac, so without a Jaz drive I can’t read it. Might as well chuck it in a bin? Folks have stuff on early Mac 3.5” drives, or Amiga drives, neither of which can be easily read on current systems. And 5.25” or 8” drives are even scarcer in working condition.

Is the only answer to find a working drive on a working computer of the day, the "technology preservation" approach? I remember an ancient engineer scolding me years ago for failure of the imagination, on the subject of disappearing 7-track magnetic tape drives. “Young man” he gruffed, “if you really care about this stuff, lay the tape on your kitchen table, cover it with paper, scatter iron filings across the top, give it a tap, and read the bits off with a magnifying glass!” He didn’t mention that I had to know the relationship between domains and bits, the parity and other features of character encodings, the block structure of the tapes etc; in those days we knew that stuff!

Now I’ve done the maths, and this is NOT feasible with a 3.5”, 2 MByte floppy disk (the iron filings are too big)! It might just be feasible with some of the 8” floppy formats. Different technological approaches (not iron filings, but some other means of making magnetic domains visible, if such means exist) might be feasible for higher densities. In any such case, you would end up with an image of the bits on the disk, in concentric tracks. From here, you have a computational task, or a series of such tasks: identify the tracks, separate into sectors, decode into bytes or characters, decode into directories and file structures, process into files, and now you have something to operate on! Yes, it’s tough to do all that, but you would be able to combine lots of contributions together to do it.

Now, I’m NOT saying that’s the best approach. I AM saying, 8” drives were advanced technology when introduced more than 20 years ago. The requirements for a production 8” drive included high read/write performance (for the time). The requirements now have changed. Performance isn't the issue; scraping every last reliable bit off that drive is!

Today, more than 20 years later, storage engineers in their clean-room high tech environments can build amazingly high performance production drives with previously inconceivable capacities and speeds. But what could you do today with a Masters-level Electronic Engineering lab, some bright students, a few hundred dollars, these ancient media formats, and a much-reduced performance requirement? I don’t care if it takes 10 minutes to read my disk, as long as I can do it!

Is this important? These disks have been stashed away for 20 years, who cares what’s on them? Well, in many cases no-one does. But just think who was using those early drives, and what for. They certainly include authors, poets, scientists, scholars, politicians, philosophers… and many of those people, if not in the first flush of youth then, are moving towards retirement now. Some of these will be candidates for approaches from libraries and archives interested in their “papers”. Previously this meant boxes of paper, photos, diaries etc. Now it includes old media, dropped in the box years ago. Who knows what treasures they may contain? (See the Digital Lives project for examples.)

So, I think there is or will be an emerging interest in these obsolete media and their contents. And at the same time, I think (hope) it would represent an interesting challenge to set students. Perhaps not quite in the same class as building a car to drive across the country on solar power, or robots to play football, but interesting in its own different way.

One of those combined Computing Science and Electronic Engineering schools would be perfect. Would a prize help? Maybe this could factor in something like the Digital Preservation Awards one year? A new kind of Digital Preservation Challenge?

Monday, 9 March 2009

Repository preservation revisited

Are institutional repositories set up and resourced to preserve their contents over the long term? Potentially contradictory evidence has emerged from my various questions related to this topic.

You may remember that on the Digital Curation Blog and the JISC-Repositories JISCmail list on 23 February 2009, I referred to some feedback from two Ideas (here and here) on the JISC Ideascale site last year, and asked 3 further questions relating to repository managers’ views of the intentions of their repositories. Given a low rate of response to the original posting (which asked for votes on the original Ideascale site), I followed this up on the JISC-Repositories list (but through oversight, not on the blog), offering the same 3 questions in a Doodle poll. The results of the several different votes appear contradictory, although I hope we can glean something useful from them.

I should emphasise that this is definitely not methodologically sound research; in fact, there are methodological holes here large enough to drive a Mack truck through! Nevertheless, we may be able to glean something useful. To recap, here are the various questions I asked, with a brief description of their audience, plus the outcomes:

a) Audience, JISC-selected “expert” group of developers, repository managers and assorted luminaries. Second point is the same audience, a little later.
Idea: “The repository should be a full OAIS [CCSDS 2002] preservation system.” Result 3 votes in favour, 16 votes against, net -13 votes.
Idea: “Repository should aspire to make contents accessible and usable over the medium term.” Result: 13 votes in favour, 1 vote against, net +12 votes.
b) Audience JISC-Repositories list and Digital Curation Blog readership. Three Ideas on Ideascale, with the results shown (note, respondents did not need to identify themselves):
My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years). Result 2 votes in favour, none against.
My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years). Result 5 votes in favour, none against.
My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years). Result 8 votes in favour, 1 vote against, net +7 votes.
A further comment was left on the Digital Curation Blog, to the effect that since most repository managers were mainly seeing deposit of PDFs, they felt (perhaps naively) sufficiently confident to assume these would be useable for 10 years.

c) Audience JISC-Repositories list. Three exclusive options on a Doodle poll, exact wording as in (c), no option to vote against any option, with the results shown below (note, Doodle asks respondents to provide a name and most did, with affiliation, although there is no validation of the name supplied):
My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years). Result 1 vote in favour.
My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years). Result 0 votes in favour.
My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years). Result 22 votes in favour.

I guess the first thing is to notice the differences between the 3 sets of results. The first would imply that long term is definitely off the agenda, and medium term is reasonable. The second is 50-50 split between long term and the short/medium term combination. The third is overwhelmingly in favour of long term (as defined).

By now you can also see at least some of the methodological problems, including differing audiences, differing anonymity, and differing wording (firstly in relation to the use of the term “OAIS”, and secondly in relation to the timescales attached to short, medium and long term). So, you can draw your own conclusions, including that none can be drawn from the available data!

Note, I would not draw any conclusions from the actual numerical votes on their own, but perhaps we can from the values within each group. However, ever hasty if not foolhardy, here are my own tentative interpretations:

First, even “experts” are alarmed at the potential implications of the term “OAIS”.
Second, repository managers don’t believe that keeping resources accessible and/or usable for 10 years (in the context of the types of material they currently manage in repositories) will give them major problems.
Third, repository managers don’t identify “accessibility and/or usability of its contents for the long term” as implying the mechanisms of an OAIS (this is perhaps rather a stretch given my second conclusion).

So, where to next? I’m thinking of asking some further questions, again of the JISC-Repositories list and the audience of the Digital Curation Blog. However, this time I’m asking for feedback on the questions, before setting up the Doodle poll. My draft texts are

My repository is resourced and is intended to keep its contents accessible and usable for the long term, through potential technology and community changes, implying at least some of the requirements of an OAIS.
My repository is resourced and is intended to keep its contents accessible and usable unless there are significant changes in technology or community, ie it does not aim to be an OAIS.
Some other choice, please explain in free text…

Are those reasonable questions? Or perhaps, please help me improve them!

This post is made both to the Digital Curation Blog and to the JISC-repositories list...

OAIS: CCSDS. (2002). Reference Model for an Open Archival Information System (OAIS). Retrieved from http://public.ccsds.org/publications/archive/650x0b1.pdf.

Monday, 2 March 2009

Report on Data Preservation in High Energy Physics

There's a really interesting (if somewhat telegraphic) report by Richard Mount of SLAC on the workshop on data preservation in high energy physics, published in the January 2009 issue of Ariadne. The workshop was held at DESY (Deutsche Elektronen-Synchrotron), Hamburg, Germany, on 26-28 January 2008.

"The workshop heard from HEP experiments long past (‘it’s hopeless to try now’), recent or almost past (‘we really must do something’) and included representatives form experiments just starting (‘interesting issue, but we’re really very busy right now’). We were told how luck and industry had succeeded in obtaining new results from 20-year-old data from the JADE experiment, and how the astronomy community apparently shames HEP by taking a formalised approach to preserving data in an intelligible format. Technical issues including preserving the bits and preserving the ability to run ancient software on long-dead operating systems were also addressed. The final input to the workshop was a somewhat asymmetric picture of the funding agency interests from the two sides of the Atlantic."

There's a great deal to digest in this report. I'd agree with its author on one section:

"Experience from Re-analysis of PETRA (and LEP) Data, Siegfried Bethke (Max-Planck-Institut für Physik)

For [Richard], this was the most fascinating talk of the workshop. It described ‘the only example of reviving and still using 25-30 year old data & software in HEP.’ JADE was an e+e- experiment at DESY’s PETRA collider. The PETRA (and SLAC’s PEP) data are unlikely to be superseded, and improved theoretical understanding of QCD (Quantum ChromoDynamics) now allows valuable new physics results to be obtained if it is possible to analyse the old data. Only JADE has succeeded in this, and that by a combination of industry and luck. A sample luck and industry anecdote:

‘The file containing the recorded luminosities of each run and fill, was stored on a private account and therefore lost when [the] DESY archive was cleaned up. Jan Olsson, when cleaning up his office in ~1997, found an old ASCII-printout of the luminosity file. Unfortunately, it was printed on green recycling paper - not suitable for scanning and OCR-ing. A secretary at Aachen re-typed it within 4 weeks. A checksum routine found (and recovered) only 4 typos.’

The key conclusion of the talk was: ‘archiving & re-use of data & software must be planned while [an] experiment is still in running mode!’ The fact that the talk documented how to succeed when no such planning had been done only served to strengthen the conclusion."

I had heard of this story from a separate source (Ken Peach, then at CCLRC), so it's good to see it confirmed. I think the article that eventuated is

Bethke, S. (2000). Determination of the QCD coupling α_s J. Phys. G: Nucl. Part. Phys., 26.

One particularly sad remark from Amber Boehnlein (US Department of Energy (DOE))

"Amber was clear about the DoE/HEP policy on data preservation: ‘there isn’t one.’"

The DCC got a mention from David Corney of STFC, who runs the Atlas Petabyte Data Store, however I can confirm that we don't have 80 staff, or anywhere near that number (just under 13 FTE, if you're interested!). The reporter may have mixed us up with David's group, which I suspect is much larger.

In the closing sessions, we have Homer Neal,

"who set out a plan for work leading up to the next workshop. In his words:
‘establish the clear justification for Data Preservation & Long Term Analysis
establish the means (and the feasibility of these means) by which this will be achieved
give guidance to the past, present and future experiments
a draft document by the next meeting @SLAC.’"

Well worth a read!

Monday, 23 February 2009

Repositories and preservation

I have a question about how repository managers view their role in relation to long term preservation.

I’m a member of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (hereafter BRTF). At our monthly teleconference last week, we were talking about preservation scenarios, and I suggested the Institutional Repository system, adding that my investigations had shown that repository managers did not (generally) feel they had long term preservation in their brief. There was some consternation at this, and a question as to whether this was based on UK repositories, as there was an expressed feeling that US repositories generally would have preservation as an aim.

My comment was based on a number of ad hoc observations and discussions over the years. But more recently I reported in an analysis of commentary on my Research Repository System ideas on discussions that had taken place on Ideascale last year, during preparatory work for a revision of the JISC Repositories Roadmap.

In this Ideascale discussion, I put forward an Idea relating to Long Term preservation: “The repository should be a full OAIS preservation system”, with the text:

“We should at least have this on the table. I think repositories are good for preservation, but the question here is whether they should go much further than they currently do in attempting to invest now to combat the effects of later technology and designated community knowledge base change...”

See http://jiscrepository.ideascale.com/akira/dtd/2276-784. This Idea turned out to be the most unpopular Idea in the entire discussion, now having gathered only 3 votes for and 16 votes against (net -13).

Rather shocked at this, I formulated another Idea, see http://jiscrepository.ideascale.com/akira/dtd/2643-784: “Repository should aspire to make contents accessible and usable over the medium term”, with the text:

“A repository should be for content which is required and expected to be useful over a significant period. It may host more transient content, but by and large the point of a repository is persistence. While suggesting a repository should be a "full OAIS" has not proved acceptable to this group so far, investment in a repository and this need for persistence suggest that repository managers should aim to make their content both accessible and usable over the medium (rather than short) term. For the purposes of this exercise, let's suggest factors of around 3: short term 3 years, medium term around 10 years, long term around 30 years plus. Ten years is a reasonable period to aspire to; it justifies investment, but is unlikely to cover too many major content migrations.

“To achieve this, I think repository management should assess their repository and its policies. Using OAIS at a high level as a yard stick would be appropriate. Full compliance would not be required, but thought to each major concept and element would be good practice.”

This Idea was much more successful, with 13 votes for and only one vote against, for a net positive 12 votes. (For comparison, the most popular Idea, “Define repository as part of the user’s (author/researcher/learner) workflow” received 31 votes for and 3 against, net 28.)

Now it may be that the way the first Idea was phrased was the cause of its unpopularity. It appears that the 4 letters OAIS turn a lot of people off!

So, here are 3 possible statements:

1) My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years)
(http://jiscrepository.ideascale.com/akira/dtd/14100-784 )

2) My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years)
(http://jiscrepository.ideascale.com/akira/dtd/14101-784 )

3) My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years).
(http://jiscrepository.ideascale.com/akira/dtd/14102-784 )

Could repository managers tell me which they feel is the appropriate answer for them? Just click on the appropriate URI and vote it up (you may have to register, I’m not sure).

(ermmm, I hope JISC doesn’t mind my using the site like that… I think it’s within the original spirit!)

(This was also a post to the JISC-Repositories list)

Friday, 13 February 2009

Open Office as a document migration on demand tool- again

We’ve seen suggestions in comments on this blog, and on other blogs, that code is better than specifications as representation information, and that well-used, running open source code is better than proprietary code. We’ve also had assertions that documents should be preserved in their original format, rather than migrated on ingest (I’ve some reservations on this in some cases for data, but as long as the original form is ALSO preserved, it’s fine).

The appropriate strategy for documents in obsolete formats would therefore seem to be to preserve in the original format and migrate on demand, from the original format to a current format, when an actual customer wants to use it. This process should always be left as late as possible, based on the possibility that the migration tool will improve and render the document better with later versions (and allowing the cost to be placed onto the user, not the archive, if appropriate). By the way, this exactly parallels the case in real archives; they don't translate their documents from old Norse to modern English each time the latter changes. If you want to read them, go learn old Norse, or hope that someone has earned a brownie point by translating it for publication somewhere...

I have suggested a couple of times that a plausible hypothesis for “office documents” (ie text documents, spreadsheets, simple drawings, presentations, simple databases), is that the OpenOffice.org should be the migration tool of choice. After all, it supposedly reads files in 180+ different file formats, it is open source, it is widely used, it is actively developed, and it can produce output in at least one internationally standardised format. I’ve noted already that it isn’t perfect; the Mac version, for instance, fails to open my problem PowerPoint version 4 files (to be fair, it doesn’t claim to). But perhaps it’s worth taking a look again at the range of formats it claims to deal with. All these figures relate to vanilla OpenOffice.org 3.0.0.0 (build 9358) for the Mac (the first fully native, up to date Mac version I've been able to lay my hands on).

So, a health warning: this rather long post goes into more detail on something I've covered before!

The single OpenOffice.org application opens 6 main classes of document: text document, spreadsheet, presentation, drawing, database and formula (the lists of file formats has moved to an appendix of the Getting Started guide). In each case, a majority (or close to it) of the supported formats are OpenOffice native, or its predecessors, plus a good selection of Microsoft Office formats (Office 6.0/95/97/XP/2000 and the XML versions from Office 2003 and Office 2007), which means that the large majority of documents for the past 10 years or so should be readable, when these formats have been the dominant office suites

The most well-known (and presumably widely-used) remaining supported word processing format is WordPerfect; not clear which versions (.wpd). Then there are some interesting ones: for example DocBook, and the Chinese-developed Uniform Office Format, the Korean Hangul WP 97, AportisDoc for the Palm, Pocket Word, and the Czech T602. Interesting that: significant investment to ensure these minority but presumably significant formats can be handled.

Similarly for the spreadsheets, as well as the various native and Microsoft formats, there is also support for two earlier significant players: Lotus 1-2-3 and Quattro Pro 6.0, also dBase (.dbf) and Data Interchange Format (DIF), not to mention CSV. It’s interesting that dBase is treated as a spreadsheet rather than a database; I wonder what the limitations are.

Presentations are more limited; apart from the basic OpenOffice and MS Office variants, they include Computer Graphics Metafile (CGM) as a presentation format but not a drawing format, which is a bit odd. PDF is also included; well, it does do presentations, and they seem to have some advantages over PowerPoint.

Graphics formats have always been popular in the open source community, so it’s not surprising that a wide range of formats is supported for graphics. Aside from several OpenOffice formats, these include a few surprises such as AutoCAD’s DXF Interchange Format, and Kodak PhotoCD, as well as a large range of usual suspects (GIF, PNG, TIFF, JPEG, many bit-mapped formats).

Finally the only database supported is the OpenOffice native database (remembering that dBase is apparently supported, as a spreadsheet, presumably with limitations). I tried to open a Microsoft Access database from a previous computer (Win95?), without success. Old databases do tend to be a bit of a problem; I have heard there are significant compatibility problems even between successive versions of Access. And Formula supports a couple of OpenOffice formats, plus MathML, which should be good for scientific use today.

So, for nothing, you get a migration tool that deals with a substantial proportion of current or recent documents. I don’t have enough experience yet to judge how effective it is. I did try a trivial round-trip test: opening a Microsoft Word 2004 for Mac document in OpenOffice, saving as native OpenOffice, then re-opening and saving as Word again, followed by a document compare in Word; it revealed very small layout differences in nested bullets (which resulted in pagination changes), and a few minor changes in styles. Not quite a fully reversible migration, but the result was a perfectly acceptable rendition of the original.

Now a migrate on demand tool is only useful in this role as long as (or if) the original file format is supported. If you are interested in older documents, from what one might call the baroque period of early personal and office documents (say from the invention of microcomputers for home use, through early “personal computers”, up to the big shakeouts of the mid to late 1990s), you will find OpenOffice rather less helpful as a migrate on demand tool. On one argument, this doesn’t matter much, as comparatively speaking such formats represent a tiny minority of surviving documents (unproven but pretty safe assertion!). However, this class of baroque period documents is starting to become important to archives (real archives, not collections of backups, or even digital preservation repositories), as they begin to collect them as part of the “papers” of eminent individuals. See for example, the Digital Lives project mentioned here before.

So, here are two proposals (for both of which, specifications as well as known working code would be useful!):

Funders, Foundations etc - please fund efforts to add input filters to OpenOffice for such older document formats, and
Computing Science departments - please set group assignments that would result in components of such filters being contributed to the OpenOffice effort.

Collectively, we might suggest the underlying effort here as an OpenOffice Legacy Files Project. Does anyone know how to set up such a project?

BTW after my last posting on this topic, a linkback led me to a post where Leslie Johnston mentioned Conversions Plus as having been a life-saver on several occasions. It’s a commercial tool, so maybe there are licence and survivability issues, but the list of formats it claims is impressive. In the Word Processing area alone, you get:

3 versions of Ami Pro
2 versions of AppleWorks
ClarisWorks 1.0 - 5.0
3 versions of MacWrite
DCA-RFT
3 versions of Multimate
Many versions pf Word, back to MS Word DOS 5.5
Several versions of MS Works
PerfectWorks
WordPerfect for DOS and Windows
WordPerfect Works
WordStar for DOS
Several versions of Lotus Word Pro

Is a tool like this a better bet than OpenOffice for migration on demand? In the longer term, I don’t think so, even if it might be more helpful in the short term. You’d have to be convinced that the company will still exist to supply it, and that it will still run on your then current hardware. It might, but the odds seem somewhat better for a very popular open source application like OpenOffice.

But in the end, one way or another, you pays your money and you place your bets!

Friday, 16 January 2009

Kilbride new Director for Digital Preservation Coalition

I'm very pleased that William Kilbride has been appointed Executive Director of the DPC. I've worked with William for several years, touching on preservation and also geospatial matters. Given the kind of small organisation the DPC is, William is an excellent person to engage with its community and to represent it and its views. The official press release follows:

The Digital Preservation Coalition (DPC) is pleased to announce that Dr William Kilbride has been appointed to the post of DPC Executive Director.

William has many years of experience in the digital preservation community. He is currently Research Manager for Glasgow Museums, where he has been involved in digital preservation and access aspects of Glasgow's museum collections, and in supporting the curation of digital images, sound recordings and digital art within the city's museums.

Previously he was Assistant Director of the Archaeology Data Service where he was involved in many digital preservation activities. He has contributed to workshops, guides and advice papers relating to digital preservation.

In the past William has worked with the DPC on the steering committee for the UK Needs Assessment, was a tutor on the Digital Preservation Training Programme and was a judge for the 2007 Digital Preservation Award.

Although the DPC registered office will remain in York, William will be based at the University of Glasgow. He will take up the post on the 23rd February 2009.

Ronald Milne
Chairman
Digital Preservation Coalition