Monday, 28 April 2008

PDF: Preserves Data Forever? Hmm…

It’s always good to see new papers coming out of the DPC. Some fantastic work has been undertaken under the DPC banner over the years, and the organisation has done a great job of raising awareness and contributing to approaches addressing digital preservation. But I was somewhat concerned to read the press release announcing their latest Technology Watch – stating that ‘PDF should be used to preserve information for the future’ and ‘the already popular PDF file format adopted by consumers and business alike is one of the most logical formats to preserve today’s electronic information for tomorrow.

This is a fairly controversial statement to make. Yes, PDF can have its uses in a preservation environment, particularly for capturing the appearance characteristic of a document. New versions of the PDF reader tend to render old files in the same way as old viewers. It has the advantage of being an open standard, despite being proprietary, and conversion tools are freely and widely available. But, it is not a magic bullet and there are several potential shortcomings – for example, PDFs are commonly created by non-Adobe applications which return varying quality or functionality in PDF files; it’s not useful for preserving other types of digital records such as emails, spreadsheets, websites or databases; it’s not great for machine parsing (as Owen pointed out in a previous comment on this blog) and there are several issues with the PDF standard which even led to development of PDF/A – PDF for Archiving.

To be fair, the press release does later say that the report suggests adopting PDF/A as a potential solution to the problem of long term digital preservation. And the report itself also focuses more on ‘electronic documents’ than electronic information per se, a generic ‘catch all’ phrase that includes types of information for which PDF is just not suitable.

So what is the report all about? Well, essentially it’s an introduction to the PDF family – PDF/A, PDF/X, PDF/E, PDF/Healthcare, and PDF/UA, in a fairly lightweight preservation context. There is some discussion of alternative formats – including TIFF, ODF, and the use of XML (particularly with regards to XPS, Microsoft's XML Paper Specification) – and an overview of current PDF standards development activities. It’s good to have such an easily digestible overview of the general PDF/A specs and the PDF family. But what I really missed in the report was an in-depth discussion of the practical issues surrounding use of PDF for preservation. For example, how can you convert standard PDF files to PDF/A? How do you convert onwards from PDF/A into another format? In which contexts may PDF/A be unsuitable, for example, in light ofa specific set of preservation requirements? What if you required external content links to remain functional, as PDF/A does not allow external content references – what would you get instead? Would the file contain accessible information to tell you that a given piece of text previously used to link to an external reference? And what exactly is the definition of external reference here – external to the document, or external to the organisation? Should links to an external document in the same records series actually be preserved with functionality intact, and is it even possible?

Speaking of preservation requirements, it would have been particularly useful if the report included a discussion of preservation requirements for formats – this would have informed any subsequent selection or rejection of a format, especially in the section on ‘technologies’. The final section on recommendations hints at this, but does not go into detail. There are also a few choice statements that simply left me wondering – one that really caught my eye was ‘this file format may be less valuable for archival purposes as it may be considered to be a native file format’ (p17), which seems to discount the value of native formats altogether. Perhaps the benefits of submission of native formats alongside PDF representations is a subject which deserves more discussion, particularly for preserving structural and semantic characteristics.

I wholeheartedly agree with perhaps the most pertinent comment in the press release – that PDF ‘should never be viewed as the Holy Grail. It is merely a tool in the armoury of a well thought out records management policy’ (Adrian Brown, National Archives). PDF can have its uses, and the report has certainly encouraged more debate in the organisations I work with as to what those uses are. Time will tell as to whether the debate will become broader still.


