Wednesday, 16 April 2008

Thoughts on conversion issues in an Institutional Repository

A few people from a commercial repository solution visited UKOLN last week to talk about their brand and the services they offer. This was a useful opportunity to explore the issues around using commercial repository solutions rather than developing a system in-house, which is where most of my experience with institutional repositories has lain to date.

One thing that particularly caught my interest was the fact that their service accept deposits from authors in the native file format. These are then converted to an alternative format for dissemination. In the case of text documents, for example, this format is PDF. The source file is still retained, but it’s not accessible to everyday users. The system is therefore storing two copies of the file – the native source and a dissemination (in this case PDF) representation.

This is a pretty interesting feature and one that I haven’t come across much so far in other institutional repository systems, particularly for in-house developments. But, as with every approach, there are pros and cons. So what are the pro’s? Well, as most ‘preservationists’ would agree, storage of the source file is widely considered to be A Good Thing. We don’t know what we’re going to be capable of in the future, so storing the source file enables us to be flexible about the preservation strategy implemented, particularly for future emulations. Furthermore, it can also be the most reliable file from which to carry out migrations: if each iterative migration results in a small element of loss, then the cumulative effects of migrations can make small loss into big loss; the most reliable file from which to start a migration and minimise the effect of loss should therefore be the source file.

However, there are obvious problems (or cons) in this as well. The biggest contender is the simple problem of technological obsolescence that we’re trying to combat in the first place. Given that our ability to reliably access the file contents (particularly proprietary ones) is under threat with the passage of time, there’s no guarantee that we’ll be able to carry out future migrations from the source file if we don’t also ensure we have the right metadata and retain ongoing access to appropriate and usable software (for example). And knowing which software is needed can be a job in itself – just because a file has a *.doc suffix doesn’t mean it was created using Word, and even if it was, it’s not necessarily easy to figure out a) which version created the file and b) what unexpected features the author has included in the file that may impact on the results of a migration. This latter point is an issue not just in the future, but now.

Thinking about this led me to consider the issue of responsibility. It’s not unreasonable to think that by accepting source formats and carrying out immediate conversions for access purposes, the repository (or rather the institution) is therefore assuming responsibility for checking and validating conversion outcomes. If it goes wrong and unnoticed errors creep in from such a conversion, is the provider (commercial or institutional) to blame for potentially misrepresenting academic or scientific works? Insofar as immediate delivery of objects from an institutional repository goes, this should - at the very least - be addressed in the IR policies.

It’s impossible to do justice to this issue in a single blog post. But these are interesting issues – not just responsibility but also formats and conversion – that I suspect we’ll hear more and more about as our experience with IRs grows.


  1. Another issue around the format is the ability to use data mining tools on items in the repository. The SPECTRa-T project ( has found just how poor pdf is as a format for machine parsing - using the source file for text based materials at least would be a benefit here.

    I think the submission in original format, and the conversion to a dissemination format is the best we can hope for, and is certainly better than the pdf submission situation we are generally in at the moment.

  2. So the approach of offering (say) both a Word version (for mining) and a PDF version (for expected longevity) sounds like it might be useful. I'm seeing that more often in repositories these days...


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.