Monday, 7 July 2008

Middle Earth survey preservation

This isn't, as the title of the post might suggest, a posting about preserving geological or geographical data. The CADAIR repository at Aberystwyth University recently ingested a unusually large audience response survey to the Lord of the Rings film, containing just under 25,000 responses from speakers of 14 different languages. One of the repository workers, Stuart Lewis, has posted the details of this on his blog. In short, two versions of the database have been archived: an MS Access version, and an XML representation created by MS Access. These are accompanied by PDF and DOC versions of a user guide to the LotR survey data, and they are all accessible via the local CADAIR DSpace installation.

It's not clear how long the repository will preserve these items for and I couldn't find a preservation policy to provide any insight on this. We had a quick look at both versions and spotted a few issues that affect usability regardless of intended length of retention, such as:
  • Several questions contain numeric values as answers, but it's not clear what the values are supposed to represent. Eg, on the gender question, does 1 = male or female? You could probably work them out by cross referencing the data against the copy of the questionnaire included in the accompanying user guide, but this isn't a failsafe technique!
  • You still need the user guide to understand the content, even for descriptive plain text entries, as column names are only abstracts of the question. For example, it's not clear what the answers in the column 'middle earth' are actually addressing without recourse to the user guide. This is the same in both the XML and MS Access version and highlights a potential issue in relying on the built-in converter tool alone because it doesn't allow you to manipulate the XML and add extra value. For instance, why not include more detailed information about the questions in the XML file?
  • It would be useful to know from the dataset record entry in CADAIR that it contains text in multiple languages - the dc.language value is english and this is correct insofar as field names go, but not for all of the content.
There are also possible issues with the independence of the XML created by MS Office, but I don't have much first hand experience of it so wouldn't like to comment too much. If anyone reading this does have such experience, then please share it!


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.