Tuesday, 28 July 2009

Rosenthal at Sun-PASIG in Malta

I was very pleased to hear David Rosenthal reprise his CNI keynote on digital preservation for the Sun-PASIG meeting in Malta, a few weeks ago now. David is a very original thinker and careful speaker. I’ve fallen into the trap before of mis-remembering him, and then arguing from my faulty version. I even noted two tweets made contemporaneously with his talk, that misquoted him and changed the meaning subtly (see below). Luckily, David has made his CNI presentation available in an annotated version on his blog, so I hope I don’t make the same mistake again.

If you were not able to hear this talk, please go read that blog post. David has some important things to say, pretty much all of which I agree strongly with. No real surprise there, as part of the talk at least echoes concerns I expressed in the “Excuse Me…” Ariadne article (Rusbridge, 2006), which on reflection was probably influenced by earlier meetings with David among others.

So here’s the highly condensed version: Jeff Rothenberg was wrong in his famous 1995 Scientific American article (Rothenberg, 1995). The important digital preservation problems for society are not media degradation or media obsolescence or format obsolescence, because important stuff is online (and more or less independent of media), and widely used formats no longer go obsolescent the way they used to when Jeff wrote the article. The important issue is money, as collecting all we need will be ruinously expensive. Every dollar we spend on non-problems (like protecting against format obsolescence) doesn’t go towards real problems.

And if you are so imbued with conventional preservation wisdom as to think that summary is nonsense, but you haven’t read the blog post, go read it before making up your mind!

David concludes:

"Practical Next Steps

Everyone - just go collect the bits: Not hard or costly to do a good enough job, Please use Creative Commons licenses

Preserve Open Source repositories: Easy & vital: no legal, technical or scale barriers

Support Open Source renderers & emulators
Support research into preservation tech: How to preserve bits adequately & affordably? How to preserve this decade's dynamic web of services? Not just last decade's static web of pages"
So what are the limitations of this analysis? My quick summary from a research data viewpoint:

Lots of important/valuable stuff is not online

Quite a lot of this stuff is not readable with common, open-source-compatible software packages

We need to keep contextual metadata as well as the bits for a lot of this stuff… and yes, we do need to learn how to do this in a scalable way.

David clearly concentrates on the online world:

Now, if it is worth keeping, it is on-line

Off-line backups are temporary”

However, it’s worth remembering Raymond Clarke’s point in my earlier post from PASIG Malta about the cost advantages of offline. Particularly in the research data world, there is a substantial set of content that exists off-line, or perhaps near-line. Some of the Rothenberg risks still apply to such content. Let’s leave aside for the moment that parallels to the scenario that Rothenberg envisages continue to exist: scholars’ works encoded in obsolete digital media are starting to be ingested in archives. But more pressingly, some research projects report that their university IT departments discourage them from using enterprise backup systems for research data, for reasons of capacity limitations. So these data often exist in a ragbag collection of scarcely documented offline media (or may even be not backed up at all). In Big Science, data may be better protected, being sometimes held in large hierarchical storage management systems. A concern I have heard from the managers of such large systems is that the time needed to migrate their substantial data holdings from one generation of storage to the next can approximate the life of the system, ie several years. And clearly such systems are more exposed to risk.

Secondly, David’s comments about format obsolescence apply specifically to common formats. He says “Gratuitous incompatibility is now self-defeating”, and “Open Source renderers [exist] for all major formats” with “Open Source isn't backwards incompatible”. But unfortunately there are examples where there are valuable resources that remain at risk. There are areas with valuable content not accessible with Open Source renderers (eg engineering and architectural design). There are many cases in research where critical analysis codes are written by non-experts, with poor version control, poorly documented. And even in the mainstream world, format obsolescence can still occur in minority formats, for all sorts of reasons, including bankruptcy, but also including sheer bad design of early versions.

Finally, I’m sure David didn’t really mean “just keep the bits”. Particularly in research, but in many other areas as well, important contextual data and metadata are needed to understand the preserved data, and to demonstrate its authenticity. The task of capturing and preserving these can be the hardest part of curating and preserving the data, precisely because those directly involved need less of the context.

Oh, that double mis-quote? Talking of the difficulty of engaging with costly lawyers, David said “1 hour of 1 lawyer ~ 5TB of disk [-] 10 hours of 1 lawyer could store the academic literature”. One tweet reported this as “Lawyer effects; cost of 10 lawyer hours could save entire academic literature!” and the other as “10 hours of a lawyer's time could preserve the entire academic literature”. See what I mean? Neither save nor preserve mean the same as store!

Overall, David does a great job, in his presentation, blog post and other writings, in reminding us not to blindly accept but to challenge preservation orthodoxy. Put simply, we have to think for ourselves.

Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=9501173513&site=ehost-live

(yes, that URL IS the "permanent URL according to Ebsco!)

Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.