Tuesday, 30 September 2008

iPres 2008 Foundations session and closing remarks

David Rosenthal from Stanford on bit preservation; is it a solved problem? Would like to keep a very large amount of data (say 1 PB) for a very long time (say a century) with a high probability of survival (say 50%). Mean time to data loss predictions are optimistic because they make assumptions (often not published) such as uncorrelated failure (whereas failure is often correlated… how many pictures did we see of arrays of disks in the same room, candidates for correlated failures from several possible causes). Practical tests with large disk arrays show much more pessimistic results. David estimates that to meet his target of a PB for a century with 50% probability of loss would require “9 9s” reliability, which is unachievable. I find his logic unassailable and his conclusions unacceptable, but I guess that doesn’t make them wrong! I suppose I wonder what the conclusions might have been about keeping a Terabyte for 10 years, if we had made the calculation 20 years ago based on best available data, versus doing those calculations now.

(Later: perhaps I’m thinking of this wrong? If David is saying that keeping a PB for a century with a 50% or greater chance of ZERO loss is extraordinarily difficult, even requiring 9 9s reliability, then that’s much, much more plausible, in fact downright sensible! So maybe the practical certainty of loss, forces us to look for ways of mitigating that loss, such as Richard Wright’s plea for formats that fail more gracefully..)

Yan Han from Arizona looked at modelling reliable systems, once you have established data for the components. Some Markov chain models that rapidly got very complex, but I think he was suggesting that a 4-component model might approach the reliability requirements put forward by RLG (0.001%? where did that number come from?). How this work would be affected by all the other error factors that David was discussing.

Christian Keitel from Ludwigsburg quoted one Chris Rusbridge as saying that digital preservation was “easy and cheap”, and then comprehensively showed that it was at least complicated and difficult. He did suggest that sensible bringing together of digital preservation and non-digital archival approaches (in both directions) could suggest some improvements in practice that do simplify things somewhat. So after the gloom of the first two papers, this was maybe a limited bit of good news.

Finally Michael Hartle of Darmstad spoke on a logic rule-based approach to format specification, for binary formats. This was quite a complex paper, which I found quite hard to assimilate. A question on whether the number of rules required was a measure of format complexity got a positive response. For a while I though Michael might be on the road to a formal specification of Representation Information from OAIS, but subsequently I was less confident of this; Michael comes from a different community, and was not familiar with OAIS, so he couldn’t answer himself.

In discussion, we felt that the point of David’s remarks was that we should understand that perfection was not achievable in the digital world, as it never was in the analogue world. We have never managed to keep all we wanted to keep (or to have kept) for as long as we wanted to keep it, without any damage.

At the end of the session, I felt perhaps I should either ask the web archivists to withdraw my quoted remarks, or perhaps we should amend them to “digital preservation is easy, cheap, and not very good”! (In self defence I should point out that the missing context for the first quote was: in comparison with analogue preservation, and we only needed to look out of our window at the massive edifice of the British Library, think of its companions across the globe, and compare their staff and budgets for analogue preservation with those applied to digital preservation, to realise there is still some truth in the remark!)

In the final plenary, we had some prizes (for the DPE Digital Preservation Challenge), some thanks for the organisers, and then some closing remarks from Steve Knight of NZNL. He picked up the “digital preservation deprecated” meme that appeared a few times, and suggested “permanent access” as an alternative. He also demonstrated how hard it is to kick old habits (I’ve used the deprecated phrase myself so many times over the past few days). Steve, I think it’s fine to talk about the process amongst ourselves, but we have to talk about the outcomes (like permanent access) when we’re addressing the community that provides our sustainability.

Where I did part company with Steve (or perhaps, massively misunderstand his intent), was when he apparently suggested that “good enough”… wasn’t good enough! I tried to argue him out of this position afterwards (briefly), but he was sure of his ground. He wanted better than good enough; he felt that posterity demanded it (harks back to some of Lynne Brindley’s opening comments about today’s great interest being yesterday’s ephemera, for instance). He thought if we aimed high we would fail high. Fresh from my previous session, I am worried that if we aim too high we will spend too high on unachievable perfection, and as a result keep too little, and ultimately fail low. But Steve knows his business way better than I do, so I’ll leave the last word with him. Digital Preservation: Joined Up and Working (the conference sub-title) was perhaps over-ambitious as a description of the current state, he suggested, but it was a great call to arms!


  1. ...the reliability requirements putforward by RLG (0.001%? where did that number come from?).

    I think I can shed some light on this, although not as much as I would like to be able to. That figure comes from some explanatory text for one of the TRAC requirements. It isn't a figure that repositories are asked to meet - it is an example of how TRAC expects a repository to set standards for itself (in this case, relating to acceptable or anticipated levels of loss) and then be able to demonstrate why it believes it meets those self-imposed standards. My recollection is that this section started out with some members believing that no loss was acceptable; others argued that that was not attainable (and the presentations in this session do an excellent job of demonstrating the truth of that.)

    But we felt that a repository, to deserve the attribute 'trusted', ought to be able to demonstrate some insight into how reliable its storage systems and curation practices were, and how they related to the requirements of its stakeholders.

    The wiki we used to develop TRAC had the entire history of the argument and counter-argument that led to this (and many other decisions.) Unfortunately, when I last checked, its contents had been completely erased and replaced by something else :-( As it was password-protected, the Internet Archive can't help us here.

  2. For a while I though Michael might be on the road to a formal specification of Representation Information from OAIS, but subsequently I was less confident of this; Michael comes from a different community, and was not familiar with OAIS, so he couldn’t answer himself.

    After some reading on the OAIS, I think I can now answer the question positively - the BSG reasoning approach may serve as a formalization of OAIS Representation Information, unless I got something substantially wrong.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.