The appropriate strategy for documents in obsolete formats would therefore seem to be to preserve in the original format and migrate on demand, from the original format to a current format, when an actual customer wants to use it. This process should always be left as late as possible, based on the possibility that the migration tool will improve and render the document better with later versions (and allowing the cost to be placed onto the user, not the archive, if appropriate). By the way, this exactly parallels the case in real archives; they don't translate their documents from old Norse to modern English each time the latter changes. If you want to read them, go learn old Norse, or hope that someone has earned a brownie point by translating it for publication somewhere...
I have suggested a couple of times that a plausible hypothesis for “office documents” (ie text documents, spreadsheets, simple drawings, presentations, simple databases), is that the OpenOffice.org should be the migration tool of choice. After all, it supposedly reads files in 180+ different file formats, it is open source, it is widely used, it is actively developed, and it can produce output in at least one internationally standardised format. I’ve noted already that it isn’t perfect; the Mac version, for instance, fails to open my problem PowerPoint version 4 files (to be fair, it doesn’t claim to). But perhaps it’s worth taking a look again at the range of formats it claims to deal with. All these figures relate to vanilla OpenOffice.org 220.127.116.11 (build 9358) for the Mac (the first fully native, up to date Mac version I've been able to lay my hands on).
So, a health warning: this rather long post goes into more detail on something I've covered before!
The single OpenOffice.org application opens 6 main classes of document: text document, spreadsheet, presentation, drawing, database and formula (the lists of file formats has moved to an appendix of the Getting Started guide). In each case, a majority (or close to it) of the supported formats are OpenOffice native, or its predecessors, plus a good selection of Microsoft Office formats (Office 6.0/95/97/XP/2000 and the XML versions from Office 2003 and Office 2007), which means that the large majority of documents for the past 10 years or so should be readable, when these formats have been the dominant office suites
The most well-known (and presumably widely-used) remaining supported word processing format is WordPerfect; not clear which versions (.wpd). Then there are some interesting ones: for example DocBook, and the Chinese-developed Uniform Office Format, the Korean Hangul WP 97, AportisDoc for the Palm, Pocket Word, and the Czech T602. Interesting that: significant investment to ensure these minority but presumably significant formats can be handled.
Similarly for the spreadsheets, as well as the various native and Microsoft formats, there is also support for two earlier significant players: Lotus 1-2-3 and Quattro Pro 6.0, also dBase (.dbf) and Data Interchange Format (DIF), not to mention CSV. It’s interesting that dBase is treated as a spreadsheet rather than a database; I wonder what the limitations are.
Presentations are more limited; apart from the basic OpenOffice and MS Office variants, they include Computer Graphics Metafile (CGM) as a presentation format but not a drawing format, which is a bit odd. PDF is also included; well, it does do presentations, and they seem to have some advantages over PowerPoint.
Graphics formats have always been popular in the open source community, so it’s not surprising that a wide range of formats is supported for graphics. Aside from several OpenOffice formats, these include a few surprises such as AutoCAD’s DXF Interchange Format, and Kodak PhotoCD, as well as a large range of usual suspects (GIF, PNG, TIFF, JPEG, many bit-mapped formats).
Finally the only database supported is the OpenOffice native database (remembering that dBase is apparently supported, as a spreadsheet, presumably with limitations). I tried to open a Microsoft Access database from a previous computer (Win95?), without success. Old databases do tend to be a bit of a problem; I have heard there are significant compatibility problems even between successive versions of Access. And Formula supports a couple of OpenOffice formats, plus MathML, which should be good for scientific use today.
So, for nothing, you get a migration tool that deals with a substantial proportion of current or recent documents. I don’t have enough experience yet to judge how effective it is. I did try a trivial round-trip test: opening a Microsoft Word 2004 for Mac document in OpenOffice, saving as native OpenOffice, then re-opening and saving as Word again, followed by a document compare in Word; it revealed very small layout differences in nested bullets (which resulted in pagination changes), and a few minor changes in styles. Not quite a fully reversible migration, but the result was a perfectly acceptable rendition of the original.
Now a migrate on demand tool is only useful in this role as long as (or if) the original file format is supported. If you are interested in older documents, from what one might call the baroque period of early personal and office documents (say from the invention of microcomputers for home use, through early “personal computers”, up to the big shakeouts of the mid to late 1990s), you will find OpenOffice rather less helpful as a migrate on demand tool. On one argument, this doesn’t matter much, as comparatively speaking such formats represent a tiny minority of surviving documents (unproven but pretty safe assertion!). However, this class of baroque period documents is starting to become important to archives (real archives, not collections of backups, or even digital preservation repositories), as they begin to collect them as part of the “papers” of eminent individuals. See for example, the Digital Lives project mentioned here before.
So, here are two proposals (for both of which, specifications as well as known working code would be useful!):
- Funders, Foundations etc - please fund efforts to add input filters to OpenOffice for such older document formats, and
- Computing Science departments - please set group assignments that would result in components of such filters being contributed to the OpenOffice effort.
BTW after my last posting on this topic, a linkback led me to a post where Leslie Johnston mentioned Conversions Plus as having been a life-saver on several occasions. It’s a commercial tool, so maybe there are licence and survivability issues, but the list of formats it claims is impressive. In the Word Processing area alone, you get:
- 3 versions of Ami Pro
- 2 versions of AppleWorks
- ClarisWorks 1.0 - 5.0
- 3 versions of MacWrite
- 3 versions of Multimate
- Many versions pf Word, back to MS Word DOS 5.5
- Several versions of MS Works
- WordPerfect for DOS and Windows
- WordPerfect Works
- WordStar for DOS
- Several versions of Lotus Word Pro
But in the end, one way or another, you pays your money and you place your bets!