Digital Curation Blog: Representation Information

Showing posts with label Representation Information. Show all posts

Tuesday, 17 February 2009

Data Curation for Integrated Science: the 4th Rumsfeld Class!

I had to talk today to a workshop of NERC (Natural Environment Research Council) Data Managers, on data curation for integrated science. Integrated science does seem to be a term that comes up a lot in environmental sciences, for good reasons when contemplating global change. However, there doesn’t seem to be a good definition of integrated science on the web, so I had to come up with my own for the purposes of the talk: “The application of multiple scientific disciplines to one or more core scientific challenges”. The point of this was that scientists operating in integrated science MUST be USING unfamiliar data. Then the implications for data managers, particularly environmental data managers, were that they must make their data available for unfamiliar users. What does this imply?

By some strange mental processes, and a fortuitous Google search, this led me to the Poetry & Philosophy of Donald H Rumsfeld, as exposed to the world by Hart Seely, initially in Slate in April 2003, and now published in a book, “Pieces of Intelligence”, by Seely. These poems are well worth a look for their own particular delight, but the one I was looking for you will probably have heard of in various guises, The Unknown:

‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing

Now this insightful (!) poem (set to music by Bryant Kong, available at http://www.stuffedpenguin.com/rumsfeld/lyrics.htm) perhaps defines 3 epistemological classes:

Known knowns,
Known unknowns, and
Unknown unknowns

Logically there should be a 4th Rumsfeld class: the unknown knowns. And I think this class is especially important for data management for unfamiliar users.

The problem is that in many research projects, there are too many people who “know too much”; with so much shared knowledge, much is un-documented. In OAIS terms, we are looking here at a small, tight Designated Community with a shared Knowledge Base, and consequently little need for Representation Information. In integrated science, and particularly environmental sciences, as the Community broadens and time extends, effectively the need for Representation Information increases. I’m using the terms very broadly here, and RepInfo can be interpreted here in many different ways. But the requirement is to make more explicit the tacit knowledge, the unknown knowns implicit in the data creation and acquisition process.

Interestingly, subsequent speakers seemed to pick up on the idea of making explicit the unknown knowns, so maybe the 4th Rumsfeld Class is here to stay in NERC!

Tuesday, 6 January 2009

Specifications again

The previous post was a summary with relatively little comment from me. I really liked David Rosenthal's related blog post, but I feel I do need to make some comments. I'm not sure this isn't yet another case of furiously agreeing!

Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:

"It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format."

The suggested scenarios re missing renderers are:

"1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for...
2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written...
3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken...
4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers...
5. An open source renderer was written but in the interim was lost...
6. An adequate open source renderer was written, but in the interim stopped working..."

Read David's post for the detail of his arguments. However, I'd just like to suggest a few reasons why preserving specifications might be useful:

- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.
- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.
- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).
- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).

So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac.

I noted a comment from David:

"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."

Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!

So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:

"Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now."

Well, you surely can't use specifications unless they are accessible and have been preserved...

However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!

Email discussion on the usefulness of file format specifications

This is a summary of an email exchange on the DCC Associates email list over a few days in late November, early December. I thought it was revealing of attitudes to preservation formats and to representation information (in the form of both specifications and running code), so I’ve summarised it here. Emails lists are great for promoting discussion, but threads tend to fracture off in various directions, so a summary can be useful. Quotes are reproduced with permission; my thanks to all those involved.

Steve Rankin from the DCC down in Rutherford Labs noticed and drew the list’s attention to the Microsoft pages relating to their binary formats, made available under a so-called "Microsoft Open Specification Promise”.

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx and http://www.microsoft.com/interop/osp/default.mspx

Chris Puttick of Oxford Archaeology pointed out that the pages had been up for a while (since February 2008 at least). He made a couple of interesting points:

“I have it on excellent authority that the specifications are useful but incomplete […]; secondly that as is this not the first time MS have published such information only to take it down again later [so] anyone interested in them should download them as soon as possible. I have on slightly less excellent authority that a ‘promise’ as encased in the [Open Specification Promise] is specifically something in US law and may not have any validity outside of the US.”

Kevin Ashley from ULCC/NDAD agreed:

“It's my understanding - from those who have tried - that earlier specs that MS published failed exactly that test. It wasn't possible to use them to write software that dealt with all syntactic and semantic variations.

“It's a fairly fundamental test for network protocols that one can […] get two separate implementations to communicate with each other. The same is true of file formats, to my mind, and one can see the creating application and the reading application as equivalent to the two ends of a network connection, albeit not necessarily in real time.”

David Rosenthal from Stanford and LOCKSS injected some engineering reality from direct experience into the discussion. He has already released a longer blog post based on the discussion and his contribution; effectively he seemed to be aiming to demolish the argument for keeping specifications at all.

“Speaking as someone who has helped implement PostScript from the specifications, I can assure you that published specifications are always incomplete. There is no possibility of specifying formats as complex as CAD or Word so carefully that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same code. And there is no possibility of digital preservation efforts being able to afford the investment to do high-quality clean-room implementations of these complex formats. Look at the investment represented by the Open Office suite, for example.

“On the other hand, note that Open Office and other open source office suites in practice do an excellent job of rendering MS formats, and their code thus represents a very high quality specification for these formats. Code is the best representation for preservation metadata.”

Colin Neilson from DCC SCARP wondered what the implications of incomplete specifications were for the concept of Representation Information in OAIS (RepInfo is often associated in examples with specifications).

He wrote:

“I am interested in implications for areas (such as CAD software) where proprietary (secret sauce) formats are historically the norm. Is the legacy of digital working always preservable within an OAIS framework? […] Are there some limits in using an OAIS model if some "specifications" are inadequate or information is not available?”

and in a later message

“Do we need to have "access software" preserved (long term) if the other representation information is less complete in the case where standards for proprietary file formats (say like Microsoft word DOC format) are to a degree incomplete, less adequate or not available (perhaps more so in the case of older versions of file formats)?”

Personally I think one of the advantages of Open Office is that it is not just Access Software, but Open Source Access Software. This should give it much greater longevity. But of course, such alternatives don’t exist in many areas, including many of the CAD formats Colin is concerned about.

Alan Morris from Morris and Ward asked the obvious question:

“Who would even consider utilizing WORD as a preservation format?”

… and got a surprising answer, from Peter Murray-Rust from the eponymous Cambridge research group!

“I would, and I argued this in my plenary lecture at OpenRepositories08. Not surprisingly it generated considerable discussion, from both sides.

“First the disclaimer. I receive research funding (though not personal funding) from Microsoft Research. Some of you may wish to stop reading now! But I don't think it colours my judgment.

“My argument was not that Word2007 should be the only format, but that it should be used in conjunction with formats such as PDF. We have a considerable amount of work on [depositing] born-digital theses and we have recommended that theses should be captured in their original format (OOXML, ODT, LaTeX, etc.) as well as the PDF.

“I am a scientist (chemist) but generally interested in all forms of STM data (for example we collaborated in part of the KIM project mentioned a few emails ago). If you believe that preservation only applied to the holy "fulltext", stop reading now. However I think many readers would agree that much of the essential information in STM work (experiments, data, protocols, code, etc.) is lost in the process of publication and reposition. Very frequently, however, the original born-digital work contains semantic information which can be retrieved. For example OOXML and ODT allow nearly 100% of chemical information (molecular structures) to be retrieved (in certain circumstances), whereas PDF allows 0% by default. (It is possible, though extremely difficult and extremely lossy, to turn PDF primitives back into chemistry)

“Note that we also work on Open Office documents and have a JISC-sponsored collaboration with Peter Sefton [of the Australian Digital Futures Institute of USQ in Australia] on his excellent ICE system. We are exploring how easy it is to author chemistry directly into an ODT document and by implication into any compound semantic document (note that XML is the only practical way of holding semantics). […]”

“We've looked into using PDF for archiving chemistry and found that current usage makes this almost impossible. So we work with imperfect material.

“Note that Word2007 can emit OOXML that can be interpreted with Open Source tools. The conversion is not 100%, but whatever is? […]”

“I wonder whether the all the detractors of OOXML have looked at it in detail. Yes, it is probably impossible to recreate all the minutiae of typesetting, but it preserves much of the embedded information that less semantic formats (PDF and even LaTeX) do not. If I have no commercial software and someone gives me a PDF of chemistry and someone else gives me OOXML I'd choose the OOXML. HTML is, in many cases, a better format than PDF.

“So my suggestion is simple. Use more than one document format. After all do we really know what future generations want from the preservation process. It costs almost nothing as we are going to have to address compound documents and packaging anyway.”

An anonymous contributor suggested that the appropriate course was to structure AIPs to contain both original source format and the preservation format. In the future, he asserted, better tools may exist to take the original source format and render a more completely accessible preservation format, particularly bearing in mind scientific notation.

Finally, Geoffrey Brown from Indiana also argued in favour of keeping the original (and against NARA policy):

“The Bush administration as well as various companies managed to embarrass themselves with inadvertently leaked information in the form of edit histories in word documents. Migration will likely (who knows ?) discard such information unless special care is taken in developing migration tools.

“I am uncomfortable with the assumption that we can abandon the original documents as NARA seems to be doing by requiring(?) agencies to submit documents in PDF). The edit histories are part of the historical record; however, it's safe to say that most patrons will be satisfied with the migrated document.

“Digital repositories have an obligation to figure out how to preserve access to documents in any format and not use format as a gatekeeper.”

So… running code is better than specs as representation information, and Open Source running code is better than proprietary running code. And, even if you migrate on ingest, keep BOTH the rich format and a desiccated format (like PDF/A). It won’t cost you much and may win you some silent thanks from your eventual users!

Wednesday, 5 November 2008

Some interesting posts elsewhere

I’m sorry for the gap in posting; I’ve been taking a couple of weeks of leave at the end of my trip to Australia. Since return I’ve been catching up on my blog reading, and there are some interesting posts around.

A couple of people (Robin Rice and Jim Downing in particular) have mentioned the post Modelling and storing a phonetics database inside a store, from the Less Talk, More Code blog (Ben O'Steen). This is a practical report on the steps Ben took to put a database into a Fedora Commons-based repository. He details the analysis he went through, the mappings he made, the approaches to capturing representation information, to making the data citable at different levels of granularity, and an interesting approach that he calls “curation by addition”, which appears to be a way of curating the data incrementally, capturing provenance information of all the changes made. It’s a great report, and I look forward to more practical reports of this nature.

Quite a different post on peanubutter (whose author might be Frank Gibson): The Triumvirate of Scientific Data discusses ideas that he suggests relate to the significant properties of science data. His triumvirate comprises

"content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean?"

Oddly, the discussion associated with this blog post is on Friendfeed rather than associated with the blog itself. Very interesting to see the discussion recorded like that, and in the process see at least one sceptic become more convinced!

To me, there seemed to be strong resonances between his argument and some of the OAIS concepts, particularly Representation Information. However, context, syntax and semantics might be a more approachable set of labels than RepInfo!

Monday, 14 April 2008

Representation information from the planets?

Well, from the PLANETS project actually. A PLANETS report written by Adrian Brown of TNA on Representation Information Registries, drawn to our attention as part of the reading for the Significant Properties workshop, contains the best discussion on representation information I have seen yet (just in case, I checked the CASPAR web site, but couldn’t see anything better there). No doubt nearly all of the information is in the OAIS spec itself, but it’s often hard to read, with discussion of key concepts separated in different parts of the spec.

Just to recap, the OAIS formula is that a Data Object interpreted using its Representation Information yields an Information Object. Examples often cite specifications or standards, eg suggesting that the Repinfo (I’ll use the contraction instead of “representation information”) for a PDF Data Object might be (or include) the PDF specification.

Sometimes there is controversy about repinfo versus format information (often described by the repinfo enthusiasts as “merely structural repinfo”). So it’s nice to read a sensible comparison:

"For the purposes of this paper, the definition of a format proposed by the Global Digital Format Registry will be used:

“A byte-wise serialization of an abstract information model”.

The GDFR format model extends this definition more rigorously, using the following conceptual entities:

• Information Model (IM) – a class of exchangeable knowledge.
• Semantic Model (SM) – a set of semantic information structures capable of realizing the meaning of the IM.
• Syntactic Model (CM) – a set of syntactic data units capable of expressing the SM.
• Serialized Byte Stream (SB) – a sequence of bytes capable of manifesting the CM.

This equates very closely with the OAIS model, as follows:

• Information Model (IM) = OAIS Information Object
• Semantic Model (SM) = OAIS Semantic representation information
• Syntactic Model (CM) = OAIS Syntactic representation information
• Serialized Byte Stream (SB) = OAIS Data Object"

This does seem to place repinfo and format information (by this richer definition) in the same class.

Time for a short diversion here. I was quite taken by the report on significant properties of software, presented at the workshop by Brian Matthews (not that it was perfect, just that it was a damn good effort at what seemed to me to be an impossible task!). He talked about specifications, source code and binaries as forms of software. Roughly the cost of instantiating goes down as you move across those 3 (in a current environment, at least).

In preservation terms, if you only have a binary, you are pretty much limited to preserving the original technology or emulating it, but the result should perform “exactly” as the original.
If you have the source code, you will be able to (or have to) migrate, configure and re-build it. The result should perform pretty much like the original, with “small deviations”. (In practice, these deviations could be major, depending on what’s happened to libraries and other dependencies meanwhile.)
If you only have the spec, you have to re-write from scratch. This is clearly much slower and more expensive, and Brian suggests it will “perform only gross functionality”. I think in many cases it might be better than that, but in some cases much worse (eg some of the controversy about the MicroSoft-based OOXML standard with MS internal dependencies).

So on that basis, a spec as Repinfo is looking, well, not much help. In order for a Data Object to be “interpreted using” repinfo, the latter needs to be something that run or performs; in Brian’s term a binary, or at least software that works. The OAIS definitions of repinfo refer to 3 sub-types: structure, semantic and “other”, and the latter is not well defined. However, Adrian Brown’s report explains there is a special type of “other”:

“…Access Software provides a means to interpret a Data Object. The software therefore acts as a substitute for part of the representation information network – a PDF viewer embodies knowledge of the PDF specification, and may be used to directly access a data object in PDF format.”

This seems to make sense; again, it’s in the OAIS spec, but hard to find. So Brown proposes that:

“…representation information be explicitly defined as encompassing either information which describes how to interpret a data object (such as a format specification), or a component of a technical environment which supports interpretation of that object (such as a software tool or hardware platform).”

Of course the software tool or hardware platform will itself have a shorter life than the descriptive information, so both may be required.

The bulk of the report, of course, is about representation information registries (including format registries by this definition), and is also well worth a read.

Thursday, 20 March 2008

Legacy document formats

On the O'Reill XML blog, which I always read with interest (particularly in relation to the shenanigins over OOXML and ODF standardisation), Rick Jelliffe writes An Open Letter to Microsoft, IBM/Lotus, Corel and others on Lodging Old File Formats with ISO. He points out that

"Corporations who were market leaders in the 1980s and 1990s for PC applications have a responsibility to make sure that documentation on their old formats are not lost. Especially for document formats before 1990, the benefits of the format as some kind of IP-embodying revenue generator will have lapsed now in 2008. However the responsibility for archiving remains.

"So I call on companies in this situation, in particular Microsoft, IBM/Lotus, Corel, Computer Associates, Fujitsu, Philips, as well as the current owners of past names such as Wang, and so on, to submit your legacy binary format documentation for documents (particularly home and office documents) and media, to ISO/IEC JTC1 for acceptance as Technical Specifications.[...] Handing over the documentation to ISO care can shift the responsibility for archiving and making available old documentation from individual companies, provide good public relations, and allow old projects to be tidied up and closed."

This is in principle a Good Idea. However, ISO documents are not Open Access; the specifications Rick refers to would benefit greatly from being Open. They would form vitally important parts of our effort to preserve digital documents. Instead of being deposited in ISO, they should be regarded as part of Representation Information for those file types, and deposited in a variety (more than one, for safety's sake) of services such as PRONOM at The National Archive in the UK, the proposed Harvard/Mellon Global Digital Format Registry, the Library of Congress Digital Preservation activity or the DCC's own Registry/Repository of Representation Information.

Tuesday, 24 July 2007

Question on approaches to curating textual material

Dave Thompson, Digital Curator at the Wellcome Library asked a question on the Digital Preservation list (which is not well set up for discussion just now). I've replied, but we agreed I would adapt my reply for the blog for any further discussion that might emerge.

"I'm looking for arguments for and against when, and if, digital material should be normalised. I'm thinking about the long term management of textual material in proprietary formats such as MS Word. I see three basic approaches on which I'm seeking the lists comments and thoughts.

The first approach normalises textual material at the point of ingestion, converting all incoming material to a neutral format such as XML immediately. This would create an open format manifestation with the aim of long term sustainable management.

The second approach would be one of 'wait and see', characterised by recognising that if a particular format isn't immediately 'at risk' of obsolescence why touch it until some form of migration becomes necessary at some future point.

The third approach preserves the bitstream as acquired and delivers it in an unmodified form upon request, ie MS Word in – MS Word out.

The first approach requires tools, resources and investment immediately. The second requires these same resources, and possibly more, in the future. The future requirements for the third approach are perhaps unknown aside from that of adequate technical metadata.

I'm interested in ideas about the sustainability of these approaches, the costs of one approach over the other and the perceived risks of moving material to an open format sooner rather than later. I'd be very interested in examples of projects which have taken either approach."

Dave, the questions you ask have been rumbling on for years. The answers, reasonably enough, keep changing. Partly depending on who asks and who answers, but also depending on the time and the context. So that's a lot of help isn't it?

You might want to look at a posting in David Rosenthal's blog on Format Obsolescence as the Prostate Cancer of Preservation (for younger and/or non-male curators, the reference is that many more men die WITH prostate cancer than die because of it.) Lots of food for thought there, and some of the same themes I was addressing in my Ariadne article a year or so ago.

The simplest answer to your question is "it depends". If you've got lots of money, and given the state of flux right now in the word processing market, I would suggest doing both (1) and (3): that is make sure you preserve your ingested bits un-changed, but also create a "normalised" copy in your favourite open format.

What format should that be? Well for Word at the moment it might be sticky. PDF (strictly PDF/A if we're into preservation) might be appropriate. However as far as ever extracting useful science from the document is concerned, the PDF is a hamburger (as Peter Murray Rust says; he reports Mike Kay as the origin: "Converting PDF to XML is a bit like converting hamburgers into cow"). PDF is useful where you want to treat something exactly as page images; it is also probably much less useful for documents like spreadsheets (where the formulae are important).

Open Document Format is an international standard (ISO/IEC 26300:2006) supported by Open Source code with a substantial user and developer base, so its long term sustainability should be pretty strong. I've heard that there can be glitches in the conversions, but I have no experience (the Mac does not seem to be quite so well served). Office Open XML has been ratified by ECMA, and is moving (haltingly?) towards an ISO standard. Presumably its conversion process will be excellent, but I don't know of much open source code base yet. However the user base is enormous, and MS seems to be getting some messages from its users about sustainability. Nah, right now I would guess ODF wins for preservation.

It may not apply in this case, but often there is a trade-off between the extent of the work you do to ensure preservation (and the complexity and cost of that work), and the amount of stuff you can preserve. Your budget is finite, right? You can't spend the money twice. So if you over-engineer your preservation process you will preserve less stuff. The longevity of the stuff in AHDS, it turns out, was affected much more by a policy change than by any of the excellent work they did preserving it. You need to do a risk analysis to work out what to do (which is not quite the same as a crystal ball; few would have seen the AHRC policy change coming!).

It's also probably true that half or more of the stuff you preserve will not be accessed for a very long time, if ever. Trouble is (as the captains of industry are reported to say about the usefulness of their marketing budgets, or librarians about their acquisitions) you don't know in advance which half.

Greg Janee of the NDIIP NGDA project gave a presentation at the DCC (PPT) a couple of years ago, in which he introduced Greg's equation:

Item is worth preserving for time duration T if:
(intrinsic value) * ProbT(usage) > SumT(preservation costs) + (cost to use)

... ie given a low probability of usage in time T, preservation has to be very cheap!

What I'm arguing for is not putting too much of the cost onto ingest, but leave as much as reasonable to the eventual end user. After all, YOU pay the ingest cost. Strangely, so, in a way, does the potential end user whose stuff was not preserved if you spent too much on ingest. You do need to do enough to make sure that end use is feasible, and indeed appropriate in relation to comparator archives (you don't want to be the least-used archive in the world). You also must include, in some sense or other the Representation Information to make end use possible.

But you don't have to constantly migrate your content to current formats to make it point-and-click available; in fact it may be a disservice to your users to do so. Migration on request has always seemed to me a sensible approach (I think it was first demonstrated by Mellor, Wheatley & Sergeant (Mellor 2002 *) from the CAMiLEON project based on earlier work in the CEDARS project, but also demonstrated by LOCKSS). This seems pretty much your second approach; you just have to ensure you retain a tool that will run in a current (future) environment, able to migrate the information. Unless you have control of the tool, this might suddenly get hard (when the tool vendor drops support for older formats).

I've often thought, for this sort of file type, that something like the OpenOffice.org suite might be the right base for a migration tool. After all, someone's already written the output stage and will keep it up to date. And many input filters have also already been written. If you're missing one, then form a community and write it, presto the world has support for another defunct word processor format (yeah I know, it's not quite that easy!).

I was going to argue against your option 3 (although it's what most repositories do just now). But I think I've talked myself round to it being a reasonable possibility. I would add a watching brief, though: you might decide at some point that the stuff was getting too high risk, and that some kind of migration tool should be provided (in which case you're back to option 2, really).

I get annoyed when I hear people say (what I probably also used to say) that institutional repositories are not for preservation. It's like Not-for-Profit companies; they may not be for profit, but they'd better not be for loss (I used to be on the Board of two). Repositories are not for loss. They keep stuff. Cheaply. And to date, as far as I can see, quite as well as expensive preservation services!

* MELLOR, P., WHEATLEY, P. & SERGEANT, D. (2002) Migration on Request, a Practical Technique for Preservation. Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. Proceedings.

Friday, 6 July 2007

Representation Information: what is it and why is it important?

Representation Information is a key and often misunderstood concept. To understand it, we need to look at some definitions. First of all, OAIS (CCSDS 2002) defines data thus:

“Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”

Second, we have Information:

“Information: Any type of knowledge that can be exchanged. In an exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret a string of bits as numbers representing temperature observations measured in degrees Celsius (the representation information).”

Then we have Representation Information (sometimes abbreviated as RI):

“Representation Information: The information that maps a Data Object into more meaningful concepts. An example is the ASCII definition that describes how a sequence of bits (i.e., a Data Object) is mapped into a symbol.”

As an example, we have this paragraph:

"Information is defined as any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some type of data. For example, the information in a hardcopy book is typically expressed by the observable characters (the data) which, when they are combined with a knowledge of the language used (the Knowledge Base), are converted to more meaningful information. If the recipient does not already include English in its Knowledge Base, then the English text (the data) needs to be accompanied by English dictionary and grammar information (i.e., Representation Information) in a form that is understandable using the recipient’s Knowledge Base.”

The summary is that “Data interpreted using its Representation Information yields Information”.

Now we have a key complication:

“Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding.“

So now we need another couple of definitions:

“Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.”

“Knowledge Base: A set of information, incorporated by a person or system, that allows that person or system to understand received information.”

So there are several interesting things here. First is that this obviously enshrines a particular understanding of information; one I couldn’t find in Wikipedia when I last looked (here is the article at that time; may be it will be there next time!). Floridi suggests there is no commonly accepted definition of information, and that it is polysemantic, and particularly contrasts information in Shannon’s Mathematical Theory of Communication with the “Standard Definition of Information” (Floridi, 2005). If I understand it rightly, the latter refers to factual information (with some controversy on whether it need be true), but not necessarily to instructional information (“how”).

Secondly, the introduction of the Designated Community and its Knowledge Base may be both helpful and problematic. It may be helpful because it can reduce the amount of Representation Information needed to interpret data (or even eliminate it completely). This is if the Designated Community is defined as having a Knowledge Base that allows it to understand the data, then nothing more is required. This is obviously never entirely true, and in practice even with a Designated Community that is quite strongly familiar with the data, we will expect to need some RI, perhaps to identify the particular meaning of some variables, etc.

The problematic nature arises because we now have two external concepts, the Designated Community and its Knowledge Base that influence what we must create, and which will change and must be monitored. I’ve heard the words “precise definition” used in the context of these two terms, but I am sceptical anyone can define either precisely (although the LOCKSS Statement of Conformance with OAIS has a minimalistic go; it's the only public one I could find, but I would love to see more). My colleague David Giaretta suggests that his huge project CASPAR aims to produce better definitions.

In fact, although they may be useful ideas, both the Designated Community and its Knowledge Base seem to be quite worrying terms. The best we can say is that “chemists” (for example) understand “”chemical concepts”, and that the latter have proved pretty stable at least in basic forms. But the community of chemists turns out to include a myriad of sub-disciplines, with their own subtleties of terminology, and not surprisingly introducing new concepts and abandoning old ones all the time. If we have some chemical data in our repository, we have to watch out for these concepts going from current through obsolescent, obsolete to arcane, and in theory we have to add RI at each change, to make up for the increasing gap in understanding.

The third interesting feature is that these definitions say nothing about files or file formats at all, yet “format registries” are the most common response to meeting the need for RI. TNA’s PRONOM, and the Harvard/OCLC Global Digital Format Registry (GDFR) are the two best-known examples.

Clearly files and file formats play a critical role in digital preservation. Sometimes I think this has occurred because of the roots of much of digital preservation (although not OAIS) lie in the library and cultural heritage communities, dominated as they are by complex proprietary file formats like Microsoft Word. In science, formats are probably much simpler overall, but other aspects may be more critical to “understand” (ie use in a computation) the data.

The best example I know to illustrate the difference between file format information and RI is to imagine a social science survey dataset encoded with SPSS. We may have all the capabilities required to interpret SPSS files, but still not be able to make sense of the dataset if we do not know the meaning of the variables, or do not have access to the original questionnaires. Both the latter would qualify as RI. Database schemas may provide another example of RI.

Have I shown why or how RI is a useful concept in digital curation? I'm not sure, but at least there's a start. Representation Information, as David Giaretta sometimes says, is useful for interpreting unfamiliar data!

In later posts, I’m going to try to include some specific examples of RI that relates to science data. I also intend to try to justify more strongly the role of RI in curation rather than preservation, ie through life rather than just at the end of it!

* CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.

* FLORIDI, L. (2005) Is Semantic Information Meaningful Data? Philosophy and Phenomenological Research, 70, 351-370. http://www.ingentaconnect.com/content/ips/ppr/2005/00000070/00000002/art00004

Digital Curation Blog