So I want some form of migration. In the example above, this is known as "Save as"!
However, I know that every time I do migration I introduce some sort of errors. So if I migrate from those PowerPoint 4 files to today's PowerPoint, and then from today's to tomorrow's PowerPoint, and then from tomorrow's to the next great thing, I will introduce cumulative errors whose impact I will only be able to assess at some horribly cringe-making moment, like in the middle of a presentation using a host's machine. So the best way to do migration is to start from the original file and migrate to today's version. Always. It's nuts for Microsoft to drop old file format support from its software (at least from this pint of view).
This approach of migrating from original version to today's version is called Migration on Request, and was described in a paper by Mellor, Wheatley and Sergeant back in 2002 (I referred to it earlier), but the idea hasn't caught on much. They had some other great ideas, like writing the migration tool in a specially portable version of C with all the nasty bits removed, called C--.
I have wondered from time to time however, for that class of documents we call Office Documents (word processing, spreadsheets, presentations), whether tacking onto an open source project which has a strong developer community might be a better approach. Something like OpenOffice. I'm not sure how many file formats this already supports (always growing, I guess, but Chapter 3 of the "Getting Started" documentation lists the following:
Microsoft Word 6.0/95/97/2000/XP) (.doc and .dot)... which is not a bad list (just the word processing bit, too)... and maybe extended in more up to date versions. For interest their FAQs have a question "Why does OpenOffice not support the file format my application uses?"
Microsoft Word 2003 XML (.xml)
Microsoft WinWord 5 (.doc)
StarWriter formats (.sdw, .sgl, and .vor)
AportisDoc (Palm) (.pdb)
Pocket Word (.psw)
WordPerfect Document (.wpd)
WPS 2000/Office 1.0 (.wps)
DocBook (.xml)
Ichitaro 8/9/10/11 (.jtd and .jtt)
Hangul WP 97 (.hwp)
.rtf, .txt, and .csv
"There may be several reasons, for example:Making legacy file formats more open was the subject of my previous post, and I guess we have to wait and see. But there are plenty of legacy word processing formats not on that list (Samna, for example, later to evolve into Lotus Word Pro, as well as formats for obsolete computers like the Atari, such as the German word processor SIGNUM, supposedly very good for mathematical formulae). What about earlier version of MS Word? Wikipedia lists a bunch of word processors; there must be many documents in obscure locations in these formats.
- The file formats may not be open and available.
- There may not be enough developers available to do the work (either paid or volunteer).
- There may not be enough interest in it.
- There may be reasonable, available workarounds."
With a concerted effort, we could gradually build OpenOffice input filters for these obsolete document types, thus brining them into the preservable digital world. And this is an effort that could bring in that extraordinary community of enthusiasts who do so much to build document converters and other kinds of software, so much ignored by the digital preservation community!
OpenOffice.org would be OK for some migration, but just because it claims support for all those formats does NOT mean that it supports them all properly.
ReplyDeleteTake Writer reading Microsoft Word files. You can import documents but lots of things get messed up (particularly the interaction between lists and styles). Writer also dumps a lot of Word's field codes.
One to the key things we try to do with our authors is teach them to use features that ARE supported across applications, and to work in such a way that we can render their documents as HTML, which is pretty preservation-friendly. ICE project: http://ice.usq.edu.au
Trying to support ad hoc use of any-old tool is never going to work 100%, the challenge is to show authors how they can work in a way that will help preserve their work better by making it easier and more palatable than what they do now.
Peter, I absolutely agree with your last paragraph. However, this is always an uphill struggle, and while some will produce better documents that are more easy to preserve, many others will produce horrible documents that are important to preserve, so we just have to live with them!
ReplyDeleteI also agree to a point with your first comments. On whether the "migration" currently in OpenOffice is bad... well, if we care enough about this in a open source world, we should do things to make it better. OS ought to improve over time... and remember that time is something that digital preservers do have!
I'm not at all sure however, that I regard dumping field codes as a problem from a digital preservation point of view. Are they long term significant properties?
i really like the "oo as platform" idea, because collecting all the conversion/migration efforts would definitely help.
ReplyDeletethe project i work in (livingreviews) deals a lot with latex input, but wants to venture into the non-latex world as well. having a limited tool zoo to support would help tremendously.
just to a look at ice, too. which seems an interesting option for our users. although i'd be happier, if the conversion tool chain would have a single shared intermediate format (preferrably odf). don't know if this is the case with ice presently.
I agree on the usefulness of OpenOffice. I tested OpenOffice for its pdf/a output which is quite good, though I found the new addin for Microsoft Office somewhat superior on some points (all concerned with tagging).
ReplyDeleteTwo things strike me about this discussion:
- It is not necessary to choose between either migration or emulation. I am working on Migration, but I think there is no way that migration is able to reproduce the original user experience of a document like emulation can. On the other hand emulation has disadvantages (you have to have access to the original software and be able to use it).
Both are imperfect and a good strategy will combine them. In fact they can mutually support each other (migrated file for ready access, emulation for reproducing the "historical sensation").
- Migration on request is not a solution in itself. Think of how many obsolete file formats we have to support in 100 years and how many tools we built have become useless, because they convert from an obsolete format to another obsolete format. You need some extensive normalisation to implement migration on request.