Saturday, 5 April 2008

When is revisability a significant property?

I’ve been thinking, off and on, about significant properties, and reading the papers for the significant properties workshop on Monday. Excellent papers they are, too, and suggest a fascinating day (the papers are available from here). I’m not entirely through them yet, so maybe this post is premature, but I didn’t see much mention of a property that came to mind when I was first thinking about the workshop: revisability (which I guess in some sense fits into the “behaviour” group of properties that I mentioned in the previous post).

In the analogue world, few resources that we might want to keep over the long term are revisable. Some can easily be annotated; you can write in the margins of books, and on the backs of photos, although not so easily on films or videos. But the annotations can be readily distinguished from the real thing.

In the digital world, however, most resources are at least plastic and very frequently revisable. By plastic I mean things like email messages or web pages that look very different depending on tools the reader chooses. By revisable I mean things like word processing documents or spreadsheets. When someone sends me one of the latter as an email attachment, I will almost always open it with a word processor (a machine for writing and revising), rather than a document reader. The same thing happens when I download one from the web. For web pages, the space bar is a shortcut for “page down”, and quite often I find myself attempting to use the same shortcut on a downloaded word processing document. In doing so, I have revised it (even if in only trivial ways). Typically if I want to save the down-loaded document in some logical place on my laptop, I’ll use “Save As” from within the word processor, potentially saving my revisions as invisible changes.

Annotation is rarer in the digital world, although many word processors now have excellent comment and edit-tracking facilities (by the way, there’s a nice blog post on da blog which points to an OR08 presentation on annotations, and an announcement on the DCC web site about an annotation product that looks interesting, and much of social networking is about annotation, and…).

My feeling is that there’s a default assumption in the analogue world that a document is not revisable, and an opposite assumption in the digital world.

One of the ways we deal with this when we worry about it, at least for documents designed to be read by humans, is to use PDF. We tend to think of PDF as a non-revisable format, although for those who pay for the tools it is perhaps more revisable than we think. PDF/A, I think, was designed to take out those elements that promote revisability in the documents.

If you are given a digital document and are asked to preserve it, the default assumption nearly always seems to kick in. People talk about preserving spreadsheets, worry about whether they can capture the formulae, or about preserving word processing documents, and worry about whether the field codes will be damaged. In some cases, this is entirely reasonable; in others it doesn’t matter a hoot.

When I read the InSPECT Framework for the definition of significant properties, I was delighted to find the FRBR model referenced, but disappointed that it was subsequently ignored. To my mind, this model is critical when thinking about preservation and significant properties in particular. In the FRBR object model, there are 4 levels of abstraction:
  • Work (the most abstract view of the intellectual creation)
  • Expression (a realisation of the work, perhaps a book or a film)
  • Manifestation (eg a particular edition of the book)
  • Item (eg a particular copy of the book; possibly less important in the digital world, given the triviality and transience of making copies).
The model does come from the world of cataloguing analogue objects, so it’s maybe not ideal, but it captures some important ideas. I’ll assert (without proof) that for many libraries the work is the thing; they are happy to have the latest edition, maybe the cheaper paperback (special collections are different, of course). Again I’ll assert (without proof) that for archives (with their orientation towards unique objects, such as particular documents with chain of custody etc), the four conceptual levels are more bound together.

Why is this digression important? Because many of the significant properties of digital objects are bound to the manifestation level. Preserving them is only important if the work demands it, or the nature of the repository demands it. Comparatively few digital objects have major significant properties at the work level. Some kinds of digital art would have, and maybe software does (I haven’t read the software significant properties report yet). If you focus on the object, you can get hung up on properties that you might not care about if you focus on the work.

Last time
, I said the questions raised in my mind included:
"* what properties?
* of which objects?
* for whom?
* for what purposes?
* when?
* and maybe where?"
Here I’m suggesting that for many kinds of works, revisability is not an important significant property for most users and many repositories. That would mean that for those works, transforming them into non-revisable forms on ingest is perfectly valid, indeed might make much more sense than keeping them in revisable form. This isn’t at all what I would have thought a few years ago!

5 comments:

  1. I added some some comments to a non-revisable (but annotatable) read-only snapshot of this web page using the online document annotation tool mentioned (the snapshot doesn't include this comment, of course...).

    I guess the danger to watch out for with doing any kind of file conversion at ingest
    (if you don't store the original as well) is that you might just lose some useful information and only find out about it later, e.g. if you just store PDF documents and not the original .doc / .tex file then you need to resort to heuristics (or even OCR) simply to extract the plain text. This page from a talk on OCRopus shows what happens when you try and do OCR on scientific papers...

    ReplyDelete
  2. I agree that revisability is not always worthy of preservation, and in some contexts (such as records/archives) it's sensible to lose that property at ingest time. But, as both CEDARS and INSPECT view these things, I don't think that means that it is not an SP - just that, in many cases, it is one with a low score, and therefore we can afford to lose it without destroying the value of the object.

    In many cases, though, although I might be happy that the AIP itself is non-revisable, I might want a DIP which is revisable. Reuse and repurposing is one of the primary reasons for preservation, and it's much more difficult when you have frozen objects. (For instance, I suspect that almost any E-learning objects which are retrieved from repositories are altered in some way before they are used again.)

    To answer Fred Howell's point, I think almost all repositories that are doing conversion on ingest are also retaining original bitstreams in some way as an insurance policy against exactly the type of problem he is referring to. The ones that aren't doing that are brave.

    ReplyDelete
  3. Chris: A new way to promote a good digital chain of custody is to authenticate records with a voice signature, which helps to show who collected the evidence, when it was collected, and that it has not changed since collection.

    ReplyDelete
  4. A nit-pick. You said (with reference to the FRBR model):

    I’ll assert (without proof) that for many libraries the work is the thing;

    I think they care about expressions rather than works. I can't think of any library that would place equal value on
    Seamus Heaney's Beowulf
    ,
    href="http://www.imdb.com/title/tt0442933/">Robert Zemecki's film of Beowul
    (nice to
    see a name-check for 'Anonymous' in the writing credits there for the original story :-) ) and
    href="http://www.bl.uk/onlinegallery/themes/englishlit/beowulf.html">Cotton
    MS Vitellius A.XV
    although they are all the same work.

    But I would agree that most would not place great value on which manifestation of Seamus Heaney's expression they possessed. (Strangely, they care quite a lot about different manifestations of the manuscript expression, though, such as Thorkelin A and B.)

    It doesn't alter the fact that the FRBR model has a lot to tell us about what we preserve and why - I'm in full agreement with you there.

    ReplyDelete
  5. ... er yes. I have to admit my understanding gets fuzzy at the "limits" of expressions, eg whether West Side Story is an expression of Romeo and Juliet, or a new work in its own right (both, maybe?). And I guess the people who really care about the details of different manifestations (and even items, if annotated by someone, or otherwise rare) tend to be Special Collections rather than libraries per se... But you're right, the point is to think about this carefully, and not simply accept the dogma (or the digital item) as given.

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.