Digital Curation Blog: April 2008

Monday 28 April 2008

PDF: Preserves Data Forever? Hmm…

It’s always good to see new papers coming out of the DPC. Some fantastic work has been undertaken under the DPC banner over the years, and the organisation has done a great job of raising awareness and contributing to approaches addressing digital preservation. But I was somewhat concerned to read the press release announcing their latest Technology Watch – stating that ‘PDF should be used to preserve information for the future’ and ‘the already popular PDF file format adopted by consumers and business alike is one of the most logical formats to preserve today’s electronic information for tomorrow.’

This is a fairly controversial statement to make. Yes, PDF can have its uses in a preservation environment, particularly for capturing the appearance characteristic of a document. New versions of the PDF reader tend to render old files in the same way as old viewers. It has the advantage of being an open standard, despite being proprietary, and conversion tools are freely and widely available. But, it is not a magic bullet and there are several potential shortcomings – for example, PDFs are commonly created by non-Adobe applications which return varying quality or functionality in PDF files; it’s not useful for preserving other types of digital records such as emails, spreadsheets, websites or databases; it’s not great for machine parsing (as Owen pointed out in a previous comment on this blog) and there are several issues with the PDF standard which even led to development of PDF/A – PDF for Archiving.

To be fair, the press release does later say that the report suggests adopting PDF/A as a potential solution to the problem of long term digital preservation. And the report itself also focuses more on ‘electronic documents’ than electronic information per se, a generic ‘catch all’ phrase that includes types of information for which PDF is just not suitable.

So what is the report all about? Well, essentially it’s an introduction to the PDF family – PDF/A, PDF/X, PDF/E, PDF/Healthcare, and PDF/UA, in a fairly lightweight preservation context. There is some discussion of alternative formats – including TIFF, ODF, and the use of XML (particularly with regards to XPS, Microsoft's XML Paper Specification) – and an overview of current PDF standards development activities. It’s good to have such an easily digestible overview of the general PDF/A specs and the PDF family. But what I really missed in the report was an in-depth discussion of the practical issues surrounding use of PDF for preservation. For example, how can you convert standard PDF files to PDF/A? How do you convert onwards from PDF/A into another format? In which contexts may PDF/A be unsuitable, for example, in light ofa specific set of preservation requirements? What if you required external content links to remain functional, as PDF/A does not allow external content references – what would you get instead? Would the file contain accessible information to tell you that a given piece of text previously used to link to an external reference? And what exactly is the definition of external reference here – external to the document, or external to the organisation? Should links to an external document in the same records series actually be preserved with functionality intact, and is it even possible?

Speaking of preservation requirements, it would have been particularly useful if the report included a discussion of preservation requirements for formats – this would have informed any subsequent selection or rejection of a format, especially in the section on ‘technologies’. The final section on recommendations hints at this, but does not go into detail. There are also a few choice statements that simply left me wondering – one that really caught my eye was ‘this file format may be less valuable for archival purposes as it may be considered to be a native file format’ (p17), which seems to discount the value of native formats altogether. Perhaps the benefits of submission of native formats alongside PDF representations is a subject which deserves more discussion, particularly for preserving structural and semantic characteristics.

I wholeheartedly agree with perhaps the most pertinent comment in the press release – that PDF ‘should never be viewed as the Holy Grail. It is merely a tool in the armoury of a well thought out records management policy’ (Adrian Brown, National Archives). PDF can have its uses, and the report has certainly encouraged more debate in the organisations I work with as to what those uses are. Time will tell as to whether the debate will become broader still.

Friday 25 April 2008

A Thousand Open Molecular Biology Databases

In January of each year, Nucleic Acids Research (NAR) publishes a special issue on databases for molecular biology research. To be considered, databases have to be open access (they specifically mean browsable without a username, password or payment, although it is possible there are conditions). The staggering thing is, in the past year the number of such databases passed 1,000!

My institution has a subscription to NAR, but the nice thing about the Database Issue is that it has itself been Open Access for the past 4 years or so. So you can check it out for yourself if you are interested.

I thought it might be interesting to go back and trace how they managed to get to 1,000+ databases in 15 years or so. It turns out to be relatively easy to check, back to about the year 2001; prior to that, as far as I can tell, you have to count them yourself. I did the count for 1999, so here’s a little picture of the growth since then (see Burks, 1999):

It is quite a staggering growth.

In recent years, the compilation article has been written by Michael Y. Galperin of the National Center for Biotechnology Information, US National Library of Medicine. He reports in the 2008 article that the complete list and summaries of 1078 databases are available online at the Nucleic Acids Research web site, http://www.oxfordjournals.org/nar/database/a.

This year’s article has some interesting comments on databases; I particularly liked this one on Deja Vu, which uses a tool called eTBLAST “to find highly similar abstracts in bibliographic databases, including MEDLINE…. Some highly similar publications, however, come from different authors and look extremely suspicious” (Galperin, 2008). Curious! As usual he also reports on databases that appear to be no longer maintained and have been dropped from the list (around two dozen this year). Sometimes this is related to (perhaps because of, or maybe the cause of) the content being available in other databases.

There has in fact been relatively little attrition; they claim not to re-use accession numbers, and the highest accession number so far is 1176, implying that just 98 databases have been dropped from the list! Galperin suggests “that the databases that offer useful content usually manage to survive, even if they have to change their funding scheme or migrate from one host institution to another. This means that the open database movement is here to stay, and more and more people in the community (as well as in the financing bodies) now appreciate the importance of open databases in spreading knowledge. It is worth noting that the majority of database authors and curators receive little or no remuneration for their efforts and that it is still difficult to obtain money for creating and maintaining a biological database. However, disk space is relatively cheap these days and database maintenance tools are fairly straightforward, so that a decent database can be created on a shoestring budget, often by a graduate student or as a result of a postdoctoral project. […] Subsequent maintenance and further development of these databases, however, require a commitment that can only be applauded.” (Galperin, 2005)

So is this vanity databasing (this from a blog author, mind)? “In the very beginning of the genome sequencing era, Walter Gilbert and colleagues warned of 'database explosion', stemming from the exponentially increasing amount of incoming DNA sequence and the unavoidable errors it contains. Luckily, this threat has not materialized so far, due to the corresponding growth in computational power and storage capacity and the strict requirements for sequence accuracy.” (Galperin, 2004)

It’s not clear from the quote above how worth while or well-used the databases are. In the 2006 article, Galperin began looking at measures of impact, using the Science Citation Index. We can see from his reference list that he expects the NAR paper to stand proxy for the database. The highest cited were Pfam, GO, UniProt, SMART and KEGG, all highly used “instant classics” (with >100 citations each in 2 years!). However, he writes: “On the other side of the spectrum are the databases that have never been cited in these 2 years, even by their own authors. This does not mean, of course, that these databases do not offer a useful content but one could always suggest a reason why nobody has used this or that database. Usually these databases were too specific in scope and offered content that could be easily found elsewhere.” (Galperin, 2006)

In the 2007 article, Galperin returned to this issue of how well databases are used. “However, citation data can be biased; e.g. in many articles use of information from publicly available databases is acknowledged by providing their URLs, or not acknowledged at all. Besides, some databases could be cited on the web sites and in new or obscure journals, not covered by the ISI Citation Index.” (Galperin, 2007) He then goes on to describe some alternative measures he has investigated to proxy for this citation problem. This is a real issue, I think; data and dataset citations are not made as often or as consistently as they should be, and advice is often conflicting and itself conflicts with the conflicting standards (of which perhaps the best is the NLM standard). Indeed, the NAR articles describing databases seem to stand proxy for the databases: “the user typically starts by finding a database of interest in PubMed or some other bibliographic database, then proceeds to browse the full text in the HTML format. If the paper is interesting enough, s/he would download its text in the PDF format. Finally, if the database turns to be useful, it might be acknowledged with a formal citation.”

This is probably enough for one blog post, but I’ll return, I think, to have a look at some of these databases in a bit more detail.

BURKS, C. (1999) Molecular Biology Database List. Nucl. Acids Res., 27, 1-9. http://nar.oxfordjournals.org/cgi/content/abstract/27/1/1

GALPERIN, M. Y. (2004) The Molecular Biology Database Collection: 2004 update. Nucl. Acids Res., 32, D3-22. http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D3

GALPERIN, M. Y. (2005) The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33.

GALPERIN, M. Y. (2006) The Molecular Biology Database Collection: 2006 update
. Nucleic Acids Research, 34. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D3

GALPERIN, M. Y. (2007) The Molecular Biology Database Collection: 2007 update. Nucleic Acids Research, 35, D3-D4. http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D3

GALPERIN, M. Y. (2008) The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36, D2-D4. http://dx.doi.org/10.1093/nar/gkm1037

Monday 21 April 2008

RLUK launched... but relaunch flawed?

Neil Beagrie reminds us that after 25 years, the Consortium of University (and?) Research Libraries (CURL) has relaunched itself as RLUK:

"On Friday 18th April the Consortium of Research Libraries (CURL) celebrated its 25th anniversary and launched it new organisational title: Research Libraries UK (RLUK). A warm welcome to RLUK and best wishes for the next 25 years!"

Congratulations to them... well, maybe. I had a quick look for some key documents; here's a URL I forwarded to my colleagues a year or so ago: http://www.curl.ac.uk/about/E-ResearchNeedsAnalysisRevised.pdf. Or, more recently, try something on their important HEFCE UKRDS Shared Services Study: http://www.curl.ac.uk/Presentations/Manchester%20November%2007/SykesHEFCEStudy2.pdf. Both give me a big fat "Page not found". In the latter case, when I find their tiny search box, and search for UKRDS, I get "Your search yielded no results".

I am really, desperately sad about this. Remember all the fuss about URNS? Remember all we used to say about persistent IDs? Remember "Cool URIs don't change"? The message is, persistent URIs require commitment; they require care. They don't require a huge amount of effort (its simply a redirection table, after all). But libraries should be in the forefront of making this work. I have emailed RLUK, without response so far. Come on guys, this is IMPORTANT!

Oh and just in case you think this is isolated, try looking for that really important, seminal archiving report, referenced everywhere at http://www.rlg.org/ArchTF/. I had something to do with the RLG merger into OCLC that caused that particular snafu, and after making my feelings known have been told that "We're taking steps to address not only DigiNews links but those of other pages that are still referred to from other sites and personal bookmarks". The sad thing about that particular report, you might discover, is that it doesn't appear to be archived on the Wayback machine, either, I suspect because it had a ftp URL.

[UPDATE The report does now appear on the OCLC website at http://www.oclc.org/programs/ourwork/past/digpresstudy/default.htm. When I first searched for Waters Garrett from the OCLC home page a few weeks ago, I couldn't find it. I guess they haven't quite got round to building the redirection table yet... but that can take time.]

Grump!

Institutional Repository Checklist for Serving Institutional Management

DCC News (http://www.dcc.ac.uk/, news item visible on 21 April 2008) draws our attention to this interesting paper:

Comments are requested on a draft document from the presenters of the "Research Assessment Experience" session at the EPrints User Group meeting at OR08. The "Institutional Repository Checklist for Serving Institutional Management" lists 13 success criteria for repository managers to be aware of when making plans for their repositories to provide any kind of official research reporting role for their institutional management.

Find out more (Note there are at least 3 versions, so comments have already been incorporated).

I liked this: "The numeric risk factors in the second column reflect the potential consequences of failure to deliver according to an informal nautical metaphor: (a) already dead in the water, (b) quickly shipwrecked and (c) may eventually become becalmed."

This kind of work is important; repositories have to be better at being useful tools for all kinds of purposes before they will become part of the researcher's workflow...

Wednesday 16 April 2008

Thoughts on conversion issues in an Institutional Repository

A few people from a commercial repository solution visited UKOLN last week to talk about their brand and the services they offer. This was a useful opportunity to explore the issues around using commercial repository solutions rather than developing a system in-house, which is where most of my experience with institutional repositories has lain to date.

One thing that particularly caught my interest was the fact that their service accept deposits from authors in the native file format. These are then converted to an alternative format for dissemination. In the case of text documents, for example, this format is PDF. The source file is still retained, but it’s not accessible to everyday users. The system is therefore storing two copies of the file – the native source and a dissemination (in this case PDF) representation.

This is a pretty interesting feature and one that I haven’t come across much so far in other institutional repository systems, particularly for in-house developments. But, as with every approach, there are pros and cons. So what are the pro’s? Well, as most ‘preservationists’ would agree, storage of the source file is widely considered to be A Good Thing. We don’t know what we’re going to be capable of in the future, so storing the source file enables us to be flexible about the preservation strategy implemented, particularly for future emulations. Furthermore, it can also be the most reliable file from which to carry out migrations: if each iterative migration results in a small element of loss, then the cumulative effects of migrations can make small loss into big loss; the most reliable file from which to start a migration and minimise the effect of loss should therefore be the source file.

However, there are obvious problems (or cons) in this as well. The biggest contender is the simple problem of technological obsolescence that we’re trying to combat in the first place. Given that our ability to reliably access the file contents (particularly proprietary ones) is under threat with the passage of time, there’s no guarantee that we’ll be able to carry out future migrations from the source file if we don’t also ensure we have the right metadata and retain ongoing access to appropriate and usable software (for example). And knowing which software is needed can be a job in itself – just because a file has a *.doc suffix doesn’t mean it was created using Word, and even if it was, it’s not necessarily easy to figure out a) which version created the file and b) what unexpected features the author has included in the file that may impact on the results of a migration. This latter point is an issue not just in the future, but now.

Thinking about this led me to consider the issue of responsibility. It’s not unreasonable to think that by accepting source formats and carrying out immediate conversions for access purposes, the repository (or rather the institution) is therefore assuming responsibility for checking and validating conversion outcomes. If it goes wrong and unnoticed errors creep in from such a conversion, is the provider (commercial or institutional) to blame for potentially misrepresenting academic or scientific works? Insofar as immediate delivery of objects from an institutional repository goes, this should - at the very least - be addressed in the IR policies.

It’s impossible to do justice to this issue in a single blog post. But these are interesting issues – not just responsibility but also formats and conversion – that I suspect we’ll hear more and more about as our experience with IRs grows.

Tuesday 15 April 2008

Wouldn't it be nice...

... if we had some papers for the 4th International Digital Curation Conference in Edinburgh (dinner in the Castle, anyone?) next December, created in a completely open fashion by a group of people who have perhaps never met, using data that's publicly available, and where the data and data structures in the paper are tagged (semantic web???), attached as supplementary materials, or deposited so as to be publicly available.

No guarantees, folks, since the resulting paper has to get through the Programme Committee rather than just me. But it would have a wonderful feeling of symmetry...

Monday 14 April 2008

Representation information from the planets?

Well, from the PLANETS project actually. A PLANETS report written by Adrian Brown of TNA on Representation Information Registries, drawn to our attention as part of the reading for the Significant Properties workshop, contains the best discussion on representation information I have seen yet (just in case, I checked the CASPAR web site, but couldn’t see anything better there). No doubt nearly all of the information is in the OAIS spec itself, but it’s often hard to read, with discussion of key concepts separated in different parts of the spec.

Just to recap, the OAIS formula is that a Data Object interpreted using its Representation Information yields an Information Object. Examples often cite specifications or standards, eg suggesting that the Repinfo (I’ll use the contraction instead of “representation information”) for a PDF Data Object might be (or include) the PDF specification.

Sometimes there is controversy about repinfo versus format information (often described by the repinfo enthusiasts as “merely structural repinfo”). So it’s nice to read a sensible comparison:

"For the purposes of this paper, the definition of a format proposed by the Global Digital Format Registry will be used:

“A byte-wise serialization of an abstract information model”.

The GDFR format model extends this definition more rigorously, using the following conceptual entities:

• Information Model (IM) – a class of exchangeable knowledge.
• Semantic Model (SM) – a set of semantic information structures capable of realizing the meaning of the IM.
• Syntactic Model (CM) – a set of syntactic data units capable of expressing the SM.
• Serialized Byte Stream (SB) – a sequence of bytes capable of manifesting the CM.

This equates very closely with the OAIS model, as follows:

• Information Model (IM) = OAIS Information Object
• Semantic Model (SM) = OAIS Semantic representation information
• Syntactic Model (CM) = OAIS Syntactic representation information
• Serialized Byte Stream (SB) = OAIS Data Object"

This does seem to place repinfo and format information (by this richer definition) in the same class.

Time for a short diversion here. I was quite taken by the report on significant properties of software, presented at the workshop by Brian Matthews (not that it was perfect, just that it was a damn good effort at what seemed to me to be an impossible task!). He talked about specifications, source code and binaries as forms of software. Roughly the cost of instantiating goes down as you move across those 3 (in a current environment, at least).

In preservation terms, if you only have a binary, you are pretty much limited to preserving the original technology or emulating it, but the result should perform “exactly” as the original.
If you have the source code, you will be able to (or have to) migrate, configure and re-build it. The result should perform pretty much like the original, with “small deviations”. (In practice, these deviations could be major, depending on what’s happened to libraries and other dependencies meanwhile.)
If you only have the spec, you have to re-write from scratch. This is clearly much slower and more expensive, and Brian suggests it will “perform only gross functionality”. I think in many cases it might be better than that, but in some cases much worse (eg some of the controversy about the MicroSoft-based OOXML standard with MS internal dependencies).

So on that basis, a spec as Repinfo is looking, well, not much help. In order for a Data Object to be “interpreted using” repinfo, the latter needs to be something that run or performs; in Brian’s term a binary, or at least software that works. The OAIS definitions of repinfo refer to 3 sub-types: structure, semantic and “other”, and the latter is not well defined. However, Adrian Brown’s report explains there is a special type of “other”:

“…Access Software provides a means to interpret a Data Object. The software therefore acts as a substitute for part of the representation information network – a PDF viewer embodies knowledge of the PDF specification, and may be used to directly access a data object in PDF format.”

This seems to make sense; again, it’s in the OAIS spec, but hard to find. So Brown proposes that:

“…representation information be explicitly defined as encompassing either information which describes how to interpret a data object (such as a format specification), or a component of a technical environment which supports interpretation of that object (such as a software tool or hardware platform).”

Of course the software tool or hardware platform will itself have a shorter life than the descriptive information, so both may be required.

The bulk of the report, of course, is about representation information registries (including format registries by this definition), and is also well worth a read.

Thursday 10 April 2008

4th Digital Curation Conference: Call for papers

The call is now open at http://www.dcc.ac.uk/events/dcc-2008/.

"The first day of the conference will focus on three key topics:

Radical sharing, new ways of doing science e.g. large scale research networks, mass collaboration, dynamic publishing tools, wikis, blogs, social networks, visualisations and immersive environments
Sustainability of curation
Legal issues including privacy, confidentiality and consent, intellectual property rights and provenance
"The second day of the conference will be dedicated to research and development and will feature peer-reviewed papers in themed parallel sessions."

There are several main themes:

Research Data Infrastructures (covering research data across all disciplines)
Curation and e-Research
Sustainability: balancing costs and value of digital curation and preservation
Disciplinary and Inter-disciplinary Curation Challenges
Challenging types of content
Legal Issues
Capacity Building

Key dates:

Submission of papers for peer-review: 25 July 2008
Submission of abstracts poster/demos/workshops for peer-review: 25 July 2008
Notification of authors: 19 September 2008
Final papers deadline: 14 November 2008
Submission of poster PDFs: 14 November 2008

We need good papers, so please get your thinking caps on!

UKOLN is 30

I’m at the UKOLN 30th birthday bash at the British Library, and this is my first effort at blogging while at the event. Not going well so far; it’s hard to listen, take notes and think of any critical summaries at the same time. I'll have to put the links in later...

Several initial talks have been celebratory and retrospective. We were treated to an interesting digression into naval warfare from an ex-commander of the UK aircraft carrier (or through-deck cruiser, as procured) Invincible, now laid up after 30 years (this ex-commander now runs MLA).

Cliff Lynch spoke about the changes in the last 15 years of UKOLN, as the library changed from being focused on doing traditional library tasks better (library automation) to supporting changes in scholarly practices, leading to “social-scale changes”. He was particularly upbeat about some of the recent UKOLN projects with data engagement, including eBank UK and eCrystals, and their role as part of the DCC.

Lorcan Dempsey gave a typically wide-ranging talk, “free of facts or justification”. He did attempt to show us his first PowerPoint presentation, but instead was only able to show us the error report saying that PowerPoint could not open xxx.ppt! I think everyone should do this… Microsoft, please wake up, you’re doing yourself serious damage. He took us through the pressures of concentration and diffusion in the web world, the big squeeze and the big switch. This is the story of “moving to the network level”. Many library activities are starting to be focused externally, but there is too much duplication and redundancy. Libraries, he said, face 3 challenges: giving specific local value, the “one big library on the web” idea (acronym OBLOW shows not entirely serious), and the old one of sorting library logistics so they truly become supportive of the new scholarly process.

Well that’s it up to the break, I’m off for my cup of tea…

Tuesday 8 April 2008

Seriously Seeking Significance

Yesterdays event at the British Library on Significant Properties in digital objects was a real gem. The programme was packed to the hilt, with eleven different speakers plus 45 minutes for audience discussion at the end of the day. It was a fantastic opportunity to hear more about the SigProp research that JISC has funded, as well as how significant properties are being explored in a number of other contexts such as in the PLANETS research into preservation characterisation.

Andrew Wilson started the day with a great and very targeted keynote that explored the background to significant properties research and what significant properties means to him. I was heartened when he almost immediately kicked off by talking about authenticity and significant properties, because authenticity is a big deal for me insofar as preservation is concerned! But this isn't a post about what we mean by authenticity in a digital environment, I shall save that for another day/time/place. Hot on his heels came Stephen Grace from CeRch, with an overview of the InSPECT project and future work, closely followed by David Duce presenting the conclusions of the Vector Images significant properties study. Interestingly, this used a slightly different measure of significance to the InSPECT project, which used a scale of 0 - 10 whereas the vector images study used from 0- 9. This imay be a minor deviation at the moment but might well assume more relevance when dealing with automated processes for migrating and assessing collections of mixed object types. Mike Stapleton then presented the results of the Moving Images study and, after a short break, on came Brian Matthews with the results of the Software study and Richard Davis with the results of the eLearning Objects study.

It quickly became clear that there was some variance in people's understanding of significant properties, particularly when one speaker stated that for different preservation approaches they would need different significant properties to acheve the desired level of performance. This is different to how I perceive significant properties. For me - and for several other speakers - significant properties define the essence of the object and are those elements that must be preserved in order to retain the ability to reproduce an authentic version of the object. To select different significant properties based on a given preservation approach surely means there is a diferent underlying understanding and use of the term significant properties.

What does this all go to show? That it's a case of different strokes for different folks? Well, to some extent, yes. It was widely accepted several years ago that different sectors had different requirements insofar as preservation was concerned - I remember attending an ERPANET workshop in Amsterdam in 2004, for example, that clearly illustrated just this point. And yesterday's audience and speakers represented an array of sectors with different requirements for preservation. So the whole concept of significant properties and use of the term across different sectors is something that I think we'd benefit from returning to discuss some more.

The afternoons sessions were, I think, intended to put a different perspective on the day. We heard from the PLANETS project, Barclays Bank, the DCC SCARP project, the SIGPROPS project from Chapel Hill, and a presentation on the relationship between Representation Information and significant properties. Cal Lee's presentation (SIGPROPS) on preserving attachments from email messages was fascinating and I suspect I'm not alone in wishing we'd had more time to hear more from him, but there simply wasn't the time. I suspect we would have had much more discussion if the programme had been spread out over two days - the content certainly justified it.

As always, the presentations will be available from the DPC website in due course. Keep an eye out though for the final report of the INSPECT project - they're not finished yet due to the change from the AHDS to CeRCH, but I expect it will be a fascinating read when they are done.

Saturday 5 April 2008

When is revisability a significant property?

I’ve been thinking, off and on, about significant properties, and reading the papers for the significant properties workshop on Monday. Excellent papers they are, too, and suggest a fascinating day (the papers are available from here). I’m not entirely through them yet, so maybe this post is premature, but I didn’t see much mention of a property that came to mind when I was first thinking about the workshop: revisability (which I guess in some sense fits into the “behaviour” group of properties that I mentioned in the previous post).

In the analogue world, few resources that we might want to keep over the long term are revisable. Some can easily be annotated; you can write in the margins of books, and on the backs of photos, although not so easily on films or videos. But the annotations can be readily distinguished from the real thing.

In the digital world, however, most resources are at least plastic and very frequently revisable. By plastic I mean things like email messages or web pages that look very different depending on tools the reader chooses. By revisable I mean things like word processing documents or spreadsheets. When someone sends me one of the latter as an email attachment, I will almost always open it with a word processor (a machine for writing and revising), rather than a document reader. The same thing happens when I download one from the web. For web pages, the space bar is a shortcut for “page down”, and quite often I find myself attempting to use the same shortcut on a downloaded word processing document. In doing so, I have revised it (even if in only trivial ways). Typically if I want to save the down-loaded document in some logical place on my laptop, I’ll use “Save As” from within the word processor, potentially saving my revisions as invisible changes.

Annotation is rarer in the digital world, although many word processors now have excellent comment and edit-tracking facilities (by the way, there’s a nice blog post on da blog which points to an OR08 presentation on annotations, and an announcement on the DCC web site about an annotation product that looks interesting, and much of social networking is about annotation, and…).

My feeling is that there’s a default assumption in the analogue world that a document is not revisable, and an opposite assumption in the digital world.

One of the ways we deal with this when we worry about it, at least for documents designed to be read by humans, is to use PDF. We tend to think of PDF as a non-revisable format, although for those who pay for the tools it is perhaps more revisable than we think. PDF/A, I think, was designed to take out those elements that promote revisability in the documents.

If you are given a digital document and are asked to preserve it, the default assumption nearly always seems to kick in. People talk about preserving spreadsheets, worry about whether they can capture the formulae, or about preserving word processing documents, and worry about whether the field codes will be damaged. In some cases, this is entirely reasonable; in others it doesn’t matter a hoot.

When I read the InSPECT Framework for the definition of significant properties, I was delighted to find the FRBR model referenced, but disappointed that it was subsequently ignored. To my mind, this model is critical when thinking about preservation and significant properties in particular. In the FRBR object model, there are 4 levels of abstraction:

Work (the most abstract view of the intellectual creation)
Expression (a realisation of the work, perhaps a book or a film)
Manifestation (eg a particular edition of the book)
Item (eg a particular copy of the book; possibly less important in the digital world, given the triviality and transience of making copies).

The model does come from the world of cataloguing analogue objects, so it’s maybe not ideal, but it captures some important ideas. I’ll assert (without proof) that for many libraries the work is the thing; they are happy to have the latest edition, maybe the cheaper paperback (special collections are different, of course). Again I’ll assert (without proof) that for archives (with their orientation towards unique objects, such as particular documents with chain of custody etc), the four conceptual levels are more bound together.

Why is this digression important? Because many of the significant properties of digital objects are bound to the manifestation level. Preserving them is only important if the work demands it, or the nature of the repository demands it. Comparatively few digital objects have major significant properties at the work level. Some kinds of digital art would have, and maybe software does (I haven’t read the software significant properties report yet). If you focus on the object, you can get hung up on properties that you might not care about if you focus on the work.

Last time, I said the questions raised in my mind included:

"* what properties?
* of which objects?
* for whom?
* for what purposes?
* when?
* and maybe where?"

Here I’m suggesting that for many kinds of works, revisability is not an important significant property for most users and many repositories. That would mean that for those works, transforming them into non-revisable forms on ingest is perfectly valid, indeed might make much more sense than keeping them in revisable form. This isn’t at all what I would have thought a few years ago!

Friday 4 April 2008

Adding Value through SNEEPing

I'm really quite taken by the SNEEP plug-in for ePrints that was showcased at OR08. It enables users to add their comments/annotations to an item stored in an eprints repository. These annotations are then made public and items can have numerous annotations added by different users. I don't know if there's a limit...? This is a great example of one way that value can be added to data collections in the context of digital curation - though admittedly the value of the added comments and annotations will be debatable! SNEEP can be downloaded from the SNEEP eprints installation at ULCC.

Thursday 3 April 2008

A Question of Authenticity

I’m just on my way back from Wigan, having given a presentation this afternoon on the role of the records manager in digital preservation to a group of, you’ve guessed it, records managers. I was really encouraged to see that the group had decided to dedicate their entire meeting today to tackling the thorny issue of digital preservation. It doesn’t happen very often – usually there’s just the odd session on digital preservation at this type of event – but it’s really great that they recognised from the start that digital preservation can’t be covered in 45 minutes! That’s not to say it can be covered in a day of course, but at least it gave them the opportunity to hear from a number of different speakers, all of whom approached the subject from a different perspective.

One of the things I decided to focus on was the issue of authenticity. It’s a real interest of mine and I’ve been wondering for a while if perhaps we might not be paying as much attention to it as we ought to be, particularly insofar as office records such as text documents, spreadsheets and emails are concerned. The immediate value of this type of record lies, to a great extent, in their evidential value. Their evidential value rests on their authenticity. If the authenticity of the records is compromised then their evidential value is too, and then we run into all sorts of issues like legal accountability and so on.

One of the reasons I think it’s such a relevant issue for records managers is the fact that even small migrations through different applications and application versions have the potential to impact on the authenticity of a record. Such migrations are known to have the ability to alter, for example, automatically generated content such as date and author fields in a text document. In a spreadsheet that derives strings of cell content from embedded formula, a migration of this sort can also impact on the end calculation – particularly if the spreadsheet has been badly created or has errors in it.

The problem is, of course, that migrations on this scale can take place fairly frequently. An organisation may decide to ‘upgrade’ the software they use, or a user may decide to access and develop a shared file with a different application to the one that’s commonly used, with unexpected results. If there’s no attempt to control these circumstances (and I’m not necessarily saying they can be controlled but that we should at least make an attempt to control them) then the risks to a record’s authenticity are increased. And all this before it even reaches an archive.

We talk a lot in our community about Trusted Digital Repositories and preserving digital objects once in the archive. But I’d really love to see some more discussion of how we can ensure records are maintained in an authentic way before they are actually ingested. This is where the records managers – and the records creators – come in. Because otherwise, we run the risk of preserving records whose authenticity could be in question despite their storage/preservation in a TDR. And, because they are stored/preserved in a TDR, their authenticity prior to the point of ingest may never even be questioned and non-authentic records therefore find their way into use.

Wednesday 2 April 2008

PLATTER

You may already have had a chance to look through the recent offering from DPE - the PLATTER tool. If you haven't already done so, then it's well worth a look. At fifty-odd pages long you'll need a bit more than five minutes, but it's a really interesting proposal for approaching planning for a new repository.

PLATTER stands for the Planning Tool for Trusted Electronic Repositories. The approach defines a series of nine Strategic Objective Plans that address areas considered by the DPE team as essential to the process of establishing trust. Each plan is accompanied by a series of key objectives and goals. Achieving these goals enables a repository to meet the 'ten core principles of trust' defined by the DCC/DPE/NESTOR/CRL early in 2007. Working these goals into the planning stage of the repository - and of course achieving them - should therefore put the repository in a good position to be recognised as a trusted Digital Repository at a later date. This is of course dependent on carrying out a TDR audit such as DRAMBORA, NESTOR or TRAC.

PLATTER is a really comprehensive and well thought-out checklist which has been designed to be flexible enough for use with a range of different types of repository, from IRs to national archives. So whilst it may look a bit daunting at first, it should be adaptable enough that different institutions - which may perceive themselves to have different approaches and requirements for trustworthiness - can use it. However, given that many smaller repositories (particularly IRs) have already decided that preservation - and by implicit association, trust - is something that can be put to one side and addressed at a later date, I find myself wondering just how many of them will actually use this tool when still at the planning stage.

Open Repositories 2008

The 2008 Open Repositories conference (OR08) started yesterday in Southampton. I promised Chris I'd make a blog post whilst I was there but this fell through when my laptop started playing up. Now that I'm back (and preparing to go to Wigan to speak at a different event tomorrow), I wanted to make a quick post about the session on sustainability.

There were two speakers in this session: Warwick Cathro from the National Library of Australia, and Libby Bishop from the Universities of Essex and Liverpool (yes, she works for both). Warwick gave a fascinating overview of theAustralian Digital Library Service framework, which sets out twenty nine services involved in managing and building digital libraries and repositories. He went into more detail on a number of them, including preservation and in particular the function of obsolescence notification. The Automated Obsolescence Notification System (AONS) that was established by ASPR a few years ago will play a key part in the preservation service. The AONS toolkit includes add-ons for different repository software to build format profiles and the intention is that this will eventually link in to file format registries such as PRONOM and the GDFR to function as part of a migration service (I think). It appears complimentary to the web-based format profiling service that PRESERV established in conjunction with PRONOM - the PRONOM-ROAR service for e-print repositories. Warwick noted however that file format registries need more work before they can comprehensively provide this level of funcitonality though, particularly in providing more structured file format data.

Libby Bishop introduced the Timescapes project and gave a comprehensive overview of the way the research data is collected and used. The project uses a disaggregated preservation service, essentially using the LUDOS repository at Leeds (previously from the MIDESS project) alongside data services provided by the UK Data Archive at the University of Essex. Libby had a lot of say and I'm interested to know more about the level of preservation service that is provided by the UKDA but didn't manage to ask whilst I was there yesterday.

There is another sustainabilty session running today at OR08.

Tuesday 1 April 2008

National Statistics no joke

Today the UK's new independent National Statistics Authority began work, replacing the Office for National Statistics. This is clearly an important agency; in the Today Programme this morning they were claiming that they would make a clear distinction between the statistics and political interpretation. I thought it was worth a look at their web site.

Two things initially discouraging: first, a Google search for National Statistics Agency [sic] produces a web page for the old ONS at http://www.statistics.gov.uk/default.asp; this is ALMOST but not quite the same URL as the new authority at http://www.statistics.gov.uk/... but what about the apparently similar but possibly different http://www.statisticsauthority.gov.uk/? Well, teething troubles no doubt. Second there's a prominent link on the (first of the above) home page that "ONS independence comes into effect on 1 April", and the link is broken. More teething...

A quick explore led me to the UK snapshot, with lots of interesting web pages summarising data. As an example, there is a page Headed "Acid Rain" under the "Environment" section; you get a graph and a few paragraphs of text, eg "Emissions of chemicals that can cause acid rain fell by 53.8 per cent between 1990 and 2005, from 6.9 million tonnes to 3.2 million tonnes." In fact, the page doesn't tell us whether rain became less acidic during this period, but again that's a quibble.

But I was looking for data, not summaries. I found some under "Time Series data", but had to go through a complicated sequence of selections to find an actual dataset. I selected Share Ownership, then "Total market value by sector of beneficial owner: end-2006", then "DEYQ, SRS: Ungrossed: Total Market Value:Individuals" before I got a download button. The 3 download options were:

View on Screen
Download CSV
Download Navidata

Where Navidata is a tool they make available. View on screen gives me a nicely formatted web page displaying a two-column table. Downloading the CSV gave me a cryptically named CSV file, which looks as follows (Crown Copyright, from the national Statistics Authority, Reproduced under the terms of the Click-Use Licence):

,"DEYQ",
" 1998",154.8,
" 1999",163.3,
" 2000",181.0,
" 2001",148.4,
" 2002",104.0,
" 2003",136.0,
" 2004",122.3,
" 2005",..,
" 2006",155.8,

"DEYQ","SRS: Ungrossed: Total Market Value:Individuals"
,"Not seasonally adjusted"
,"Updated on 8/ 6/2007"

All fine, I think... except I could not find any way to do this automatically. Maybe they have an API I haven't found, maybe they have plans not yet come to fruition. Anyway, some good stuff here, but perhaps room for improvement?

Digital Curation Blog