Tuesday, 6 January 2009

Email discussion on the usefulness of file format specifications

This is a summary of an email exchange on the DCC Associates email list over a few days in late November, early December. I thought it was revealing of attitudes to preservation formats and to representation information (in the form of both specifications and running code), so I’ve summarised it here. Emails lists are great for promoting discussion, but threads tend to fracture off in various directions, so a summary can be useful. Quotes are reproduced with permission; my thanks to all those involved.

Steve Rankin from the DCC down in Rutherford Labs noticed and drew the list’s attention to the Microsoft pages relating to their binary formats, made available under a so-called "Microsoft Open Specification Promise”.

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
and http://www.microsoft.com/interop/osp/default.mspx

Chris Puttick of Oxford Archaeology pointed out that the pages had been up for a while (since February 2008 at least). He made a couple of interesting points:
“I have it on excellent authority that the specifications are useful but incomplete […]; secondly that as is this not the first time MS have published such information only to take it down again later [so] anyone interested in them should download them as soon as possible. I have on slightly less excellent authority that a ‘promise’ as encased in the [Open Specification Promise] is specifically something in US law and may not have any validity outside of the US.”
Kevin Ashley from ULCC/NDAD agreed:
“It's my understanding - from those who have tried - that earlier specs that MS published failed exactly that test. It wasn't possible to use them to write software that dealt with all syntactic and semantic variations.

“It's a fairly fundamental test for network protocols that one can […] get two separate implementations to communicate with each other. The same is true of file formats, to my mind, and one can see the creating application and the reading application as equivalent to the two ends of a network connection, albeit not necessarily in real time.”
David Rosenthal from Stanford and LOCKSS injected some engineering reality from direct experience into the discussion. He has already released a longer blog post based on the discussion and his contribution; effectively he seemed to be aiming to demolish the argument for keeping specifications at all.
“Speaking as someone who has helped implement PostScript from the specifications, I can assure you that published specifications are always incomplete. There is no possibility of specifying formats as complex as CAD or Word so carefully that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same code. And there is no possibility of digital preservation efforts being able to afford the investment to do high-quality clean-room implementations of these complex formats. Look at the investment represented by the Open Office suite, for example.

“On the other hand, note that Open Office and other open source office suites in practice do an excellent job of rendering MS formats, and their code thus represents a very high quality specification for these formats. Code is the best representation for preservation metadata.”
Colin Neilson from DCC SCARP wondered what the implications of incomplete specifications were for the concept of Representation Information in OAIS (RepInfo is often associated in examples with specifications).

He wrote:
“I am interested in implications for areas (such as CAD software) where proprietary (secret sauce) formats are historically the norm. Is the legacy of digital working always preservable within an OAIS framework? […] Are there some limits in using an OAIS model if some "specifications" are inadequate or information is not available?”
and in a later message
“Do we need to have "access software" preserved (long term) if the other representation information is less complete in the case where standards for proprietary file formats (say like Microsoft word DOC format) are to a degree incomplete, less adequate or not available (perhaps more so in the case of older versions of file formats)?”
Personally I think one of the advantages of Open Office is that it is not just Access Software, but Open Source Access Software. This should give it much greater longevity. But of course, such alternatives don’t exist in many areas, including many of the CAD formats Colin is concerned about.

Alan Morris from Morris and Ward asked the obvious question:
“Who would even consider utilizing WORD as a preservation format?”
… and got a surprising answer, from Peter Murray-Rust from the eponymous Cambridge research group!
“I would, and I argued this in my plenary lecture at OpenRepositories08. Not surprisingly it generated considerable discussion, from both sides.

“First the disclaimer. I receive research funding (though not personal funding) from Microsoft Research. Some of you may wish to stop reading now! But I don't think it colours my judgment.

“My argument was not that Word2007 should be the only format, but that it should be used in conjunction with formats such as PDF. We have a considerable amount of work on [depositing] born-digital theses and we have recommended that theses should be captured in their original format (OOXML, ODT, LaTeX, etc.) as well as the PDF.

“I am a scientist (chemist) but generally interested in all forms of STM data (for example we collaborated in part of the KIM project mentioned a few emails ago). If you believe that preservation only applied to the holy "fulltext", stop reading now. However I think many readers would agree that much of the essential information in STM work (experiments, data, protocols, code, etc.) is lost in the process of publication and reposition. Very frequently, however, the original born-digital work contains semantic information which can be retrieved. For example OOXML and ODT allow nearly 100% of chemical information (molecular structures) to be retrieved (in certain circumstances), whereas PDF allows 0% by default. (It is possible, though extremely difficult and extremely lossy, to turn PDF primitives back into chemistry)

“Note that we also work on Open Office documents and have a JISC-sponsored collaboration with Peter Sefton [of the Australian Digital Futures Institute of USQ in Australia] on his excellent ICE system. We are exploring how easy it is to author chemistry directly into an ODT document and by implication into any compound semantic document (note that XML is the only practical way of holding semantics). […]”

“We've looked into using PDF for archiving chemistry and found that current usage makes this almost impossible. So we work with imperfect material.

“Note that Word2007 can emit OOXML that can be interpreted with Open Source tools. The conversion is not 100%, but whatever is? […]”

“I wonder whether the all the detractors of OOXML have looked at it in detail. Yes, it is probably impossible to recreate all the minutiae of typesetting, but it preserves much of the embedded information that less semantic formats (PDF and even LaTeX) do not. If I have no commercial software and someone gives me a PDF of chemistry and someone else gives me OOXML I'd choose the OOXML. HTML is, in many cases, a better format than PDF.

“So my suggestion is simple. Use more than one document format. After all do we really know what future generations want from the preservation process. It costs almost nothing as we are going to have to address compound documents and packaging anyway.”
An anonymous contributor suggested that the appropriate course was to structure AIPs to contain both original source format and the preservation format. In the future, he asserted, better tools may exist to take the original source format and render a more completely accessible preservation format, particularly bearing in mind scientific notation.

Finally, Geoffrey Brown from Indiana also argued in favour of keeping the original (and against NARA policy):
“The Bush administration as well as various companies managed to embarrass themselves with inadvertently leaked information in the form of edit histories in word documents. Migration will likely (who knows ?) discard such information unless special care is taken in developing migration tools.

“I am uncomfortable with the assumption that we can abandon the original documents as NARA seems to be doing by requiring(?) agencies to submit documents in PDF). The edit histories are part of the historical record; however, it's safe to say that most patrons will be satisfied with the migrated document.

“Digital repositories have an obligation to figure out how to preserve access to documents in any format and not use format as a gatekeeper.”
So… running code is better than specs as representation information, and Open Source running code is better than proprietary running code. And, even if you migrate on ingest, keep BOTH the rich format and a desiccated format (like PDF/A). It won’t cost you much and may win you some silent thanks from your eventual users!

2 comments:

  1. I would be curious to hear more about acceptable RICH formats for chemical data...Certainly Word, HTML, OOXML, etc. aren't that - nor, as pointed out, is PDF. Is it ChemML? is there something else?

    Work is currently being done by the ISO 32000 committee (the one that is responsible for PDF) to introduce native Math support (via MathML) into future versions of PDF. It would be great to have Chemical information provided in the same way.

    Leonard Rosenthol
    PDF Standards Architect
    Adobe Systems

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.