Tuesday 6 January 2009

Specifications again

The previous post was a summary with relatively little comment from me. I really liked David Rosenthal's related blog post, but I feel I do need to make some comments. I'm not sure this isn't yet another case of furiously agreeing!

Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:
"It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format."
The suggested scenarios re missing renderers are:
"1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for...
2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written...
3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken...
4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers...
5. An open source renderer was written but in the interim was lost...
6. An adequate open source renderer was written, but in the interim stopped working..."
Read David's post for the detail of his arguments. However, I'd just like to suggest a few reasons why preserving specifications might be useful:

- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.
- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.
- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).
- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).

So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac.

I noted a comment from David:

"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."

Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!

So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:
"Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now."
Well, you surely can't use specifications unless they are accessible and have been preserved...

However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!


  1. I discuss this post in a comment on my original post, explaining why although I agree that an effort to support legacy formats in Open Office is a great idea, I don't think format specifications are likely to play much of a role in it.

  2. I agree with you on this one, Chris, and would add another argument to your four. While I agree with David that an open source renderer constitutes a form of representation information and, when it exists, should be preserved, the fact of the matter is that it is a nightmarishly roundabout way of documenting a format. Deciphering what a format is like from the software used to interact with it is a bit like determining the shape of a crystal from refracted x-rays in x-ray crystallography -- yes, it works, but it's not exactly direct or simple.

    Having a working renderer allows you to do what the renderer will support, but if I want to do something different with the data in the future, I need to know how the format is really put together. I can decipher that in time using an open-source renderer (assuming the renderer supports all features of the format), but having the specification handy makes the job much easier. As you point out, the relative cost of keeping a specification around, weighed against the potential benefits to someone who needs to write new software to work with a format, seems to me to argue for keeping both specifications and open-source renders around when they're available.

    I'm also not as sanguine as David on the longevity of open source support for outdated formats. I think we should not confuse the utility of open source-supported formats for preservation with the support of the open source community for digital preservation as an activity. I find it all too easy to imagine scenarios in which support for older formats drops out of an open-source product over the long term, and even easier to imagine ones in which one open source product which supports an older format is abandoned by the open source community in favor of a different, newer one which lacks that support.

    I agree with David on the value of keeping source code for renderers around. Having code for working with a format ready at hand would be invaluable for anyone having to code a renderer in the future. But if I have to bring a renderer for an older format to life at some point in the future, I'd rather have the code and the spec. sitting in front of me when I do it.

  3. And I don't believe in software as a specification. I don't believe any programmer believes in it either. I've discussed this in a
    post of my own :-)

  4. I agree with everyone on this thread so far, assuming different specific situations or perspectives. When generally applied, though, I favor David's position, especially through an economic lens. Clearly format specifications and rendering tools can be valuable in every situation, but the relative value of one vs. the other depends on the desired future use. We all are engaged in imagining future use scenarios (based on evidence from the past and our own biases), and our imaginations may be somewhat predictive, but ultimately we have convincing evidence that we simply cannot predict future secondary value (and therefore use) of information resources. Therefore we shouldn't pretend that we can accurately predict the relative value of tools versus specs for future secondary use.

    Default emphasis on primary value--i.e., the reason a resource was created in the first place--versus secondary value--i.e., the reason the resource (turns out to be valuable so) gets re-used, seems often to be quite different for those stewarding digital science data vs. those stewarding digital cultural materials. And when manifest as strong unspoken assumptions, this default difference can make these conversations treacherous, so we should attempt to surface it more explicitly. Also complicating matters are assumptions around scope; if a stewardship organization assumes that external forces will effectively preserve primary use, then their emphasis will understandably be on secondary use. However, if we take an overarching systemic view, I suspect most stewards would answer yes to the following question:

    Regardless of how our different perspectives may affect our emphasis on secondary use, is it not true that our overall first priority should be to maintain access to the primary value, since it was recently, demonstrably, valuable? If so, then while recognizing the value in both, we should prioritize our investments in maintaining rendering tools over our investments in documenting format specs.

    Why? Because (unlike with future secondary value) we can identify the primary value latent in a digital resource: it realizes when the resource is consumed as originally intended--whether by human senses or by some algorithmic process. Or in perhaps more applicable wording, the primary value is rendered in the intended primary experience. Rendering tools for this primary experience specifically and directly support access to this primary value in a way that format specifications do not, as David logically demonstrates.

    It may indeed make economic sense to make the investment in obtaining and storing format specs when the required investment is relatively small, when it is the only available course of action, or for other equally logical reasons. However we should be willing to entertain the probability that--from a systemic, aggregate perspective--rendering tools may prove consistently more valuable than format specs, and adjust our policies accordingly.

  5. Very nice post.But I can't belive the Software specification..



Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.