Monday, 14 April 2008

Representation information from the planets?

Well, from the PLANETS project actually. A PLANETS report written by Adrian Brown of TNA on Representation Information Registries, drawn to our attention as part of the reading for the Significant Properties workshop, contains the best discussion on representation information I have seen yet (just in case, I checked the CASPAR web site, but couldn’t see anything better there). No doubt nearly all of the information is in the OAIS spec itself, but it’s often hard to read, with discussion of key concepts separated in different parts of the spec.

Just to recap, the OAIS formula is that a Data Object interpreted using its Representation Information yields an Information Object. Examples often cite specifications or standards, eg suggesting that the Repinfo (I’ll use the contraction instead of “representation information”) for a PDF Data Object might be (or include) the PDF specification.

Sometimes there is controversy about repinfo versus format information (often described by the repinfo enthusiasts as “merely structural repinfo”). So it’s nice to read a sensible comparison:
"For the purposes of this paper, the definition of a format proposed by the Global Digital Format Registry will be used:

“A byte-wise serialization of an abstract information model”.

The GDFR format model extends this definition more rigorously, using the following conceptual entities:

• Information Model (IM) – a class of exchangeable knowledge.
• Semantic Model (SM) – a set of semantic information structures capable of realizing the meaning of the IM.
• Syntactic Model (CM) – a set of syntactic data units capable of expressing the SM.
• Serialized Byte Stream (SB) – a sequence of bytes capable of manifesting the CM.

This equates very closely with the OAIS model, as follows:

• Information Model (IM) = OAIS Information Object
• Semantic Model (SM) = OAIS Semantic representation information
• Syntactic Model (CM) = OAIS Syntactic representation information
• Serialized Byte Stream (SB) = OAIS Data Object"
This does seem to place repinfo and format information (by this richer definition) in the same class.

Time for a short diversion here. I was quite taken by the report on significant properties of software, presented at the workshop by Brian Matthews (not that it was perfect, just that it was a damn good effort at what seemed to me to be an impossible task!). He talked about specifications, source code and binaries as forms of software. Roughly the cost of instantiating goes down as you move across those 3 (in a current environment, at least).

  • In preservation terms, if you only have a binary, you are pretty much limited to preserving the original technology or emulating it, but the result should perform “exactly” as the original.
  • If you have the source code, you will be able to (or have to) migrate, configure and re-build it. The result should perform pretty much like the original, with “small deviations”. (In practice, these deviations could be major, depending on what’s happened to libraries and other dependencies meanwhile.)
  • If you only have the spec, you have to re-write from scratch. This is clearly much slower and more expensive, and Brian suggests it will “perform only gross functionality”. I think in many cases it might be better than that, but in some cases much worse (eg some of the controversy about the MicroSoft-based OOXML standard with MS internal dependencies).
So on that basis, a spec as Repinfo is looking, well, not much help. In order for a Data Object to be “interpreted using” repinfo, the latter needs to be something that run or performs; in Brian’s term a binary, or at least software that works. The OAIS definitions of repinfo refer to 3 sub-types: structure, semantic and “other”, and the latter is not well defined. However, Adrian Brown’s report explains there is a special type of “other”:
“…Access Software provides a means to interpret a Data Object. The software therefore acts as a substitute for part of the representation information network – a PDF viewer embodies knowledge of the PDF specification, and may be used to directly access a data object in PDF format.”
This seems to make sense; again, it’s in the OAIS spec, but hard to find. So Brown proposes that:
“…representation information be explicitly defined as encompassing either information which describes how to interpret a data object (such as a format specification), or a component of a technical environment which supports interpretation of that object (such as a software tool or hardware platform).”
Of course the software tool or hardware platform will itself have a shorter life than the descriptive information, so both may be required.

The bulk of the report, of course, is about representation information registries (including format registries by this definition), and is also well worth a read.


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.