Representation Information: what is it and why is it important?

Representation Information is a key and often misunderstood concept. To understand it, we need to look at some definitions. First of all, OAIS (CCSDS 2002) defines data thus:
“Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”
Second, we have Information:

“Information: Any type of knowledge that can be exchanged. In an exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret a string of bits as numbers representing temperature observations measured in degrees Celsius (the representation information).”
Then we have Representation Information (sometimes abbreviated as RI):
“Representation Information: The information that maps a Data Object into more meaningful concepts. An example is the ASCII definition that describes how a sequence of bits (i.e., a Data Object) is mapped into a symbol.”
As an example, we have this paragraph:
"Information is defined as any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some type of data. For example, the information in a hardcopy book is typically expressed by the observable characters (the data) which, when they are combined with a knowledge of the language used (the Knowledge Base), are converted to more meaningful information. If the recipient does not already include English in its Knowledge Base, then the English text (the data) needs to be accompanied by English dictionary and grammar information (i.e., Representation Information) in a form that is understandable using the recipient’s Knowledge Base.”
The summary is that “Data interpreted using its Representation Information yields Information”.

Now we have a key complication:
“Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding.“
So now we need another couple of definitions:
“Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.”
“Knowledge Base: A set of information, incorporated by a person or system, that allows that person or system to understand received information.”
So there are several interesting things here. First is that this obviously enshrines a particular understanding of information; one I couldn’t find in Wikipedia when I last looked (here is the article at that time; may be it will be there next time!). Floridi suggests there is no commonly accepted definition of information, and that it is polysemantic, and particularly contrasts information in Shannon’s Mathematical Theory of Communication with the “Standard Definition of Information” (Floridi, 2005). If I understand it rightly, the latter refers to factual information (with some controversy on whether it need be true), but not necessarily to instructional information (“how”).

Secondly, the introduction of the Designated Community and its Knowledge Base may be both helpful and problematic. It may be helpful because it can reduce the amount of Representation Information needed to interpret data (or even eliminate it completely). This is if the Designated Community is defined as having a Knowledge Base that allows it to understand the data, then nothing more is required. This is obviously never entirely true, and in practice even with a Designated Community that is quite strongly familiar with the data, we will expect to need some RI, perhaps to identify the particular meaning of some variables, etc.

The problematic nature arises because we now have two external concepts, the Designated Community and its Knowledge Base that influence what we must create, and which will change and must be monitored. I’ve heard the words “precise definition” used in the context of these two terms, but I am sceptical anyone can define either precisely (although the LOCKSS Statement of Conformance with OAIS has a minimalistic go; it's the only public one I could find, but I would love to see more). My colleague David Giaretta suggests that his huge project CASPAR aims to produce better definitions.

In fact, although they may be useful ideas, both the Designated Community and its Knowledge Base seem to be quite worrying terms. The best we can say is that “chemists” (for example) understand “”chemical concepts”, and that the latter have proved pretty stable at least in basic forms. But the community of chemists turns out to include a myriad of sub-disciplines, with their own subtleties of terminology, and not surprisingly introducing new concepts and abandoning old ones all the time. If we have some chemical data in our repository, we have to watch out for these concepts going from current through obsolescent, obsolete to arcane, and in theory we have to add RI at each change, to make up for the increasing gap in understanding.

The third interesting feature is that these definitions say nothing about files or file formats at all, yet “format registries” are the most common response to meeting the need for RI. TNA’s PRONOM, and the Harvard/OCLC Global Digital Format Registry (GDFR) are the two best-known examples.

Clearly files and file formats play a critical role in digital preservation. Sometimes I think this has occurred because of the roots of much of digital preservation (although not OAIS) lie in the library and cultural heritage communities, dominated as they are by complex proprietary file formats like Microsoft Word. In science, formats are probably much simpler overall, but other aspects may be more critical to “understand” (ie use in a computation) the data.

The best example I know to illustrate the difference between file format information and RI is to imagine a social science survey dataset encoded with SPSS. We may have all the capabilities required to interpret SPSS files, but still not be able to make sense of the dataset if we do not know the meaning of the variables, or do not have access to the original questionnaires. Both the latter would qualify as RI. Database schemas may provide another example of RI.

Have I shown why or how RI is a useful concept in digital curation? I'm not sure, but at least there's a start. Representation Information, as David Giaretta sometimes says, is useful for interpreting unfamiliar data!

In later posts, I’m going to try to include some specific examples of RI that relates to science data. I also intend to try to justify more strongly the role of RI in curation rather than preservation, ie through life rather than just at the end of it!

  1. Thank you for this. The "Designated Community" has bothered me since I first read about it in the NARA/RLG Checklist, which requires that DCs and the steps taken to satisfy them be documented, but offers no hints on how to do so, much less how one poor repository manager is supposed to keep up with the dozens or hundreds of DCs who may have a stake in her repository.

    RI I can live with, as it is amenable to heuristic treatment. SPSS file? Got a codebook and provenance? Okay, fine.

    DCs are a mess, at least as their implementation in practice is currently envisioned.


