Thursday, 20 March 2008

Migration on Request: OpenOffice as a platform?

Following on from my previous post relating to legacy formats, I was thinking again about the problems of dealing with documents in those formats. For some, the answer lies in emulation and perpetual licences of those original software packages, but for me that just doesn't cut the mustard. I won't have access to those packages, but I might want access to the documents. Some of them for example, might be the PowerPoint 4 presentations created on a predecessor to the Macintosh that I use now, but which are un-readable with my current PowerPoint software (I CAN get at them by copying them to a colleague's Windows machine; her version of PowerPoint has input filters unavailable on my Mac).

So I want some form of migration. In the example above, this is known as "Save as"!

However, I know that every time I do migration I introduce some sort of errors. So if I migrate from those PowerPoint 4 files to today's PowerPoint, and then from today's to tomorrow's PowerPoint, and then from tomorrow's to the next great thing, I will introduce cumulative errors whose impact I will only be able to assess at some horribly cringe-making moment, like in the middle of a presentation using a host's machine. So the best way to do migration is to start from the original file and migrate to today's version. Always. It's nuts for Microsoft to drop old file format support from its software (at least from this pint of view).

This approach of migrating from original version to today's version is called Migration on Request, and was described in a paper by Mellor, Wheatley and Sergeant back in 2002 (I referred to it earlier), but the idea hasn't caught on much. They had some other great ideas, like writing the migration tool in a specially portable version of C with all the nasty bits removed, called C--.

I have wondered from time to time however, for that class of documents we call Office Documents (word processing, spreadsheets, presentations), whether tacking onto an open source project which has a strong developer community might be a better approach. Something like OpenOffice. I'm not sure how many file formats this already supports (always growing, I guess, but Chapter 3 of the "Getting Started" documentation lists the following:
Microsoft Word 6.0/95/97/2000/XP) (.doc and .dot)
Microsoft Word 2003 XML (.xml)
Microsoft WinWord 5 (.doc)
StarWriter formats (.sdw, .sgl, and .vor)
AportisDoc (Palm) (.pdb)
Pocket Word (.psw)
WordPerfect Document (.wpd)
WPS 2000/Office 1.0 (.wps)
DocBook (.xml)
Ichitaro 8/9/10/11 (.jtd and .jtt)
Hangul WP 97 (.hwp)
.rtf, .txt, and .csv
... which is not a bad list (just the word processing bit, too)... and maybe extended in more up to date versions. For interest their FAQs have a question "Why does OpenOffice not support the file format my application uses?"
"There may be several reasons, for example:
  • The file formats may not be open and available.
  • There may not be enough developers available to do the work (either paid or volunteer).
  • There may not be enough interest in it.
  • There may be reasonable, available workarounds."
Making legacy file formats more open was the subject of my previous post, and I guess we have to wait and see. But there are plenty of legacy word processing formats not on that list (Samna, for example, later to evolve into Lotus Word Pro, as well as formats for obsolete computers like the Atari, such as the German word processor SIGNUM, supposedly very good for mathematical formulae). What about earlier version of MS Word? Wikipedia lists a bunch of word processors; there must be many documents in obscure locations in these formats.

With a concerted effort, we could gradually build OpenOffice input filters for these obsolete document types, thus brining them into the preservable digital world. And this is an effort that could bring in that extraordinary community of enthusiasts who do so much to build document converters and other kinds of software, so much ignored by the digital preservation community!


  1. would be OK for some migration, but just because it claims support for all those formats does NOT mean that it supports them all properly.

    Take Writer reading Microsoft Word files. You can import documents but lots of things get messed up (particularly the interaction between lists and styles). Writer also dumps a lot of Word's field codes.

    One to the key things we try to do with our authors is teach them to use features that ARE supported across applications, and to work in such a way that we can render their documents as HTML, which is pretty preservation-friendly. ICE project:

    Trying to support ad hoc use of any-old tool is never going to work 100%, the challenge is to show authors how they can work in a way that will help preserve their work better by making it easier and more palatable than what they do now.

  2. Peter, I absolutely agree with your last paragraph. However, this is always an uphill struggle, and while some will produce better documents that are more easy to preserve, many others will produce horrible documents that are important to preserve, so we just have to live with them!

    I also agree to a point with your first comments. On whether the "migration" currently in OpenOffice is bad... well, if we care enough about this in a open source world, we should do things to make it better. OS ought to improve over time... and remember that time is something that digital preservers do have!

    I'm not at all sure however, that I regard dumping field codes as a problem from a digital preservation point of view. Are they long term significant properties?

  3. i really like the "oo as platform" idea, because collecting all the conversion/migration efforts would definitely help.

    the project i work in (livingreviews) deals a lot with latex input, but wants to venture into the non-latex world as well. having a limited tool zoo to support would help tremendously.

    just to a look at ice, too. which seems an interesting option for our users. although i'd be happier, if the conversion tool chain would have a single shared intermediate format (preferrably odf). don't know if this is the case with ice presently.

  4. I agree on the usefulness of OpenOffice. I tested OpenOffice for its pdf/a output which is quite good, though I found the new addin for Microsoft Office somewhat superior on some points (all concerned with tagging).

    Two things strike me about this discussion:

    - It is not necessary to choose between either migration or emulation. I am working on Migration, but I think there is no way that migration is able to reproduce the original user experience of a document like emulation can. On the other hand emulation has disadvantages (you have to have access to the original software and be able to use it).
    Both are imperfect and a good strategy will combine them. In fact they can mutually support each other (migrated file for ready access, emulation for reproducing the "historical sensation").

    - Migration on request is not a solution in itself. Think of how many obsolete file formats we have to support in 100 years and how many tools we built have become useless, because they convert from an obsolete format to another obsolete format. You need some extensive normalisation to implement migration on request.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.