Friday, 13 February 2009

Open Office as a document migration on demand tool- again

We’ve seen suggestions in comments on this blog, and on other blogs, that code is better than specifications as representation information, and that well-used, running open source code is better than proprietary code. We’ve also had assertions that documents should be preserved in their original format, rather than migrated on ingest (I’ve some reservations on this in some cases for data, but as long as the original form is ALSO preserved, it’s fine).

The appropriate strategy for documents in obsolete formats would therefore seem to be to preserve in the original format and migrate on demand, from the original format to a current format, when an actual customer wants to use it. This process should always be left as late as possible, based on the possibility that the migration tool will improve and render the document better with later versions (and allowing the cost to be placed onto the user, not the archive, if appropriate). By the way, this exactly parallels the case in real archives; they don't translate their documents from old Norse to modern English each time the latter changes. If you want to read them, go learn old Norse, or hope that someone has earned a brownie point by translating it for publication somewhere...

I have suggested a couple of times that a plausible hypothesis for “office documents” (ie text documents, spreadsheets, simple drawings, presentations, simple databases), is that the OpenOffice.org should be the migration tool of choice. After all, it supposedly reads files in 180+ different file formats, it is open source, it is widely used, it is actively developed, and it can produce output in at least one internationally standardised format. I’ve noted already that it isn’t perfect; the Mac version, for instance, fails to open my problem PowerPoint version 4 files (to be fair, it doesn’t claim to). But perhaps it’s worth taking a look again at the range of formats it claims to deal with. All these figures relate to vanilla OpenOffice.org 3.0.0.0 (build 9358) for the Mac (the first fully native, up to date Mac version I've been able to lay my hands on).

So, a health warning: this rather long post goes into more detail on something I've covered before!

The single OpenOffice.org application opens 6 main classes of document: text document, spreadsheet, presentation, drawing, database and formula (the lists of file formats has moved to an appendix of the Getting Started guide). In each case, a majority (or close to it) of the supported formats are OpenOffice native, or its predecessors, plus a good selection of Microsoft Office formats (Office 6.0/95/97/XP/2000 and the XML versions from Office 2003 and Office 2007), which means that the large majority of documents for the past 10 years or so should be readable, when these formats have been the dominant office suites

The most well-known (and presumably widely-used) remaining supported word processing format is WordPerfect; not clear which versions (.wpd). Then there are some interesting ones: for example DocBook, and the Chinese-developed Uniform Office Format, the Korean Hangul WP 97, AportisDoc for the Palm, Pocket Word, and the Czech T602. Interesting that: significant investment to ensure these minority but presumably significant formats can be handled.

Similarly for the spreadsheets, as well as the various native and Microsoft formats, there is also support for two earlier significant players: Lotus 1-2-3 and Quattro Pro 6.0, also dBase (.dbf) and Data Interchange Format (DIF), not to mention CSV. It’s interesting that dBase is treated as a spreadsheet rather than a database; I wonder what the limitations are.

Presentations are more limited; apart from the basic OpenOffice and MS Office variants, they include Computer Graphics Metafile (CGM) as a presentation format but not a drawing format, which is a bit odd. PDF is also included; well, it does do presentations, and they seem to have some advantages over PowerPoint.

Graphics formats have always been popular in the open source community, so it’s not surprising that a wide range of formats is supported for graphics. Aside from several OpenOffice formats, these include a few surprises such as AutoCAD’s DXF Interchange Format, and Kodak PhotoCD, as well as a large range of usual suspects (GIF, PNG, TIFF, JPEG, many bit-mapped formats).

Finally the only database supported is the OpenOffice native database (remembering that dBase is apparently supported, as a spreadsheet, presumably with limitations). I tried to open a Microsoft Access database from a previous computer (Win95?), without success. Old databases do tend to be a bit of a problem; I have heard there are significant compatibility problems even between successive versions of Access. And Formula supports a couple of OpenOffice formats, plus MathML, which should be good for scientific use today.

So, for nothing, you get a migration tool that deals with a substantial proportion of current or recent documents. I don’t have enough experience yet to judge how effective it is. I did try a trivial round-trip test: opening a Microsoft Word 2004 for Mac document in OpenOffice, saving as native OpenOffice, then re-opening and saving as Word again, followed by a document compare in Word; it revealed very small layout differences in nested bullets (which resulted in pagination changes), and a few minor changes in styles. Not quite a fully reversible migration, but the result was a perfectly acceptable rendition of the original.

Now a migrate on demand tool is only useful in this role as long as (or if) the original file format is supported. If you are interested in older documents, from what one might call the baroque period of early personal and office documents (say from the invention of microcomputers for home use, through early “personal computers”, up to the big shakeouts of the mid to late 1990s), you will find OpenOffice rather less helpful as a migrate on demand tool. On one argument, this doesn’t matter much, as comparatively speaking such formats represent a tiny minority of surviving documents (unproven but pretty safe assertion!). However, this class of baroque period documents is starting to become important to archives (real archives, not collections of backups, or even digital preservation repositories), as they begin to collect them as part of the “papers” of eminent individuals. See for example, the Digital Lives project mentioned here before.

So, here are two proposals (for both of which, specifications as well as known working code would be useful!):
  • Funders, Foundations etc - please fund efforts to add input filters to OpenOffice for such older document formats, and
  • Computing Science departments - please set group assignments that would result in components of such filters being contributed to the OpenOffice effort.
Collectively, we might suggest the underlying effort here as an OpenOffice Legacy Files Project. Does anyone know how to set up such a project?

BTW after my last posting on this topic, a linkback led me to a post where Leslie Johnston mentioned Conversions Plus as having been a life-saver on several occasions. It’s a commercial tool, so maybe there are licence and survivability issues, but the list of formats it claims is impressive. In the Word Processing area alone, you get:
  • 3 versions of Ami Pro
  • 2 versions of AppleWorks
  • ClarisWorks 1.0 - 5.0
  • 3 versions of MacWrite
  • DCA-RFT
  • 3 versions of Multimate
  • Many versions pf Word, back to MS Word DOS 5.5
  • Several versions of MS Works
  • PerfectWorks
  • WordPerfect for DOS and Windows
  • WordPerfect Works
  • WordStar for DOS
  • Several versions of Lotus Word Pro
Is a tool like this a better bet than OpenOffice for migration on demand? In the longer term, I don’t think so, even if it might be more helpful in the short term. You’d have to be convinced that the company will still exist to supply it, and that it will still run on your then current hardware. It might, but the odds seem somewhat better for a very popular open source application like OpenOffice.

But in the end, one way or another, you pays your money and you place your bets!

3 comments:

  1. An important issue regarding Open Office, though, is whether developer support for it will continue. If it doesn't, then it will become too out of date to be useful in a few years.

    I ran across a reference to this problem at the Coding Horror site (
    http://www.codinghorror.com/blog/archives/001215.html), where he linked to this site (http://www.gnome.org/~michael/blog/ooo-commit-stats-2008.html) on the developer stats for Open Office:

    "Crude as they are - the statistics show a picture of slow disengagement by Sun, combined with a spectacular lack of growth in the developer community. In a healthy project we would expect to see a large number of volunteer developers involved, in addition - we would expect to see a large number of peer companies contributing to the common code pool; we do not see this in OpenOffice.org. Indeed, quite the opposite we appear to have the lowest number of active developers on OO.o since records began: 24, this contrasts negatively with Linux's recent low of 160+. Even spun in the most positive way, OO.o is at best stagnating from a development perspective. "

    ReplyDelete
  2. You are probably aware of Docvert. I've used a similar approach (since I wasn't aware of Docvert back then) and integrated the TEI tools for OpenOffice to create TEI files from MS Word.

    ReplyDelete
  3. Isn't the decline in development engagement with OO in part just indicative of the fact that OO has reached a level where it is now a serious competitor against other software products? And does the decline really mean it won't continue to be useful over the longer term? I don't think so. a) it will remain useful for formats it currently supports, and b) as a software package, it's sufficiently accessible that the preservation community could, if they wish, 'sponsor' ongoing development activities to support new or alternative formats - which is what Chris suggests in part of his post. This would seem a logical activity - seize on currently available tools and tailor them to our purposes, rather than start from scratch.

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.