Monday, 28 April 2008
PDF: Preserves Data Forever? Hmm…
This is a fairly controversial statement to make. Yes, PDF can have its uses in a preservation environment, particularly for capturing the appearance characteristic of a document. New versions of the PDF reader tend to render old files in the same way as old viewers. It has the advantage of being an open standard, despite being proprietary, and conversion tools are freely and widely available. But, it is not a magic bullet and there are several potential shortcomings – for example, PDFs are commonly created by non-Adobe applications which return varying quality or functionality in PDF files; it’s not useful for preserving other types of digital records such as emails, spreadsheets, websites or databases; it’s not great for machine parsing (as Owen pointed out in a previous comment on this blog) and there are several issues with the PDF standard which even led to development of PDF/A – PDF for Archiving.
To be fair, the press release does later say that the report suggests adopting PDF/A as a potential solution to the problem of long term digital preservation. And the report itself also focuses more on ‘electronic documents’ than electronic information per se, a generic ‘catch all’ phrase that includes types of information for which PDF is just not suitable.
So what is the report all about? Well, essentially it’s an introduction to the PDF family – PDF/A, PDF/X, PDF/E, PDF/Healthcare, and PDF/UA, in a fairly lightweight preservation context. There is some discussion of alternative formats – including TIFF, ODF, and the use of XML (particularly with regards to XPS, Microsoft's XML Paper Specification) – and an overview of current PDF standards development activities. It’s good to have such an easily digestible overview of the general PDF/A specs and the PDF family. But what I really missed in the report was an in-depth discussion of the practical issues surrounding use of PDF for preservation. For example, how can you convert standard PDF files to PDF/A? How do you convert onwards from PDF/A into another format? In which contexts may PDF/A be unsuitable, for example, in light ofa specific set of preservation requirements? What if you required external content links to remain functional, as PDF/A does not allow external content references – what would you get instead? Would the file contain accessible information to tell you that a given piece of text previously used to link to an external reference? And what exactly is the definition of external reference here – external to the document, or external to the organisation? Should links to an external document in the same records series actually be preserved with functionality intact, and is it even possible?
Speaking of preservation requirements, it would have been particularly useful if the report included a discussion of preservation requirements for formats – this would have informed any subsequent selection or rejection of a format, especially in the section on ‘technologies’. The final section on recommendations hints at this, but does not go into detail. There are also a few choice statements that simply left me wondering – one that really caught my eye was ‘this file format may be less valuable for archival purposes as it may be considered to be a native file format’ (p17), which seems to discount the value of native formats altogether. Perhaps the benefits of submission of native formats alongside PDF representations is a subject which deserves more discussion, particularly for preserving structural and semantic characteristics.
I wholeheartedly agree with perhaps the most pertinent comment in the press release – that PDF ‘should never be viewed as the Holy Grail. It is merely a tool in the armoury of a well thought out records management policy’ (Adrian Brown, National Archives). PDF can have its uses, and the report has certainly encouraged more debate in the organisations I work with as to what those uses are. Time will tell as to whether the debate will become broader still.
Friday, 25 April 2008
A Thousand Open Molecular Biology Databases
My institution has a subscription to NAR, but the nice thing about the Database Issue is that it has itself been Open Access for the past 4 years or so. So you can check it out for yourself if you are interested.
I thought it might be interesting to go back and trace how they managed to get to 1,000+ databases in 15 years or so. It turns out to be relatively easy to check, back to about the year 2001; prior to that, as far as I can tell, you have to count them yourself. I did the count for 1999, so here’s a little picture of the growth since then (see Burks, 1999):
It is quite a staggering growth.
In recent years, the compilation article has been written by Michael Y. Galperin of the National Center for Biotechnology Information, US National Library of Medicine. He reports in the 2008 article that the complete list and summaries of 1078 databases are available online at the Nucleic Acids Research web site, http://www.oxfordjournals.org/nar/database/a.
This year’s article has some interesting comments on databases; I particularly liked this one on Deja Vu, which uses a tool called eTBLAST “to find highly similar abstracts in bibliographic databases, including MEDLINE…. Some highly similar publications, however, come from different authors and look extremely suspicious” (Galperin, 2008). Curious! As usual he also reports on databases that appear to be no longer maintained and have been dropped from the list (around two dozen this year). Sometimes this is related to (perhaps because of, or maybe the cause of) the content being available in other databases.
There has in fact been relatively little attrition; they claim not to re-use accession numbers, and the highest accession number so far is 1176, implying that just 98 databases have been dropped from the list! Galperin suggests “that the databases that offer useful content usually manage to survive, even if they have to change their funding scheme or migrate from one host institution to another. This means that the open database movement is here to stay, and more and more people in the community (as well as in the financing bodies) now appreciate the importance of open databases in spreading knowledge. It is worth noting that the majority of database authors and curators receive little or no remuneration for their efforts and that it is still difficult to obtain money for creating and maintaining a biological database. However, disk space is relatively cheap these days and database maintenance tools are fairly straightforward, so that a decent database can be created on a shoestring budget, often by a graduate student or as a result of a postdoctoral project. […] Subsequent maintenance and further development of these databases, however, require a commitment that can only be applauded.” (Galperin, 2005)
So is this vanity databasing (this from a blog author, mind)? “In the very beginning of the genome sequencing era, Walter Gilbert and colleagues warned of 'database explosion', stemming from the exponentially increasing amount of incoming DNA sequence and the unavoidable errors it contains. Luckily, this threat has not materialized so far, due to the corresponding growth in computational power and storage capacity and the strict requirements for sequence accuracy.” (Galperin, 2004)
It’s not clear from the quote above how worth while or well-used the databases are. In the 2006 article, Galperin began looking at measures of impact, using the Science Citation Index. We can see from his reference list that he expects the NAR paper to stand proxy for the database. The highest cited were Pfam, GO, UniProt, SMART and KEGG, all highly used “instant classics” (with >100 citations each in 2 years!). However, he writes: “On the other side of the spectrum are the databases that have never been cited in these 2 years, even by their own authors. This does not mean, of course, that these databases do not offer a useful content but one could always suggest a reason why nobody has used this or that database. Usually these databases were too specific in scope and offered content that could be easily found elsewhere.” (Galperin, 2006)
In the 2007 article, Galperin returned to this issue of how well databases are used. “However, citation data can be biased; e.g. in many articles use of information from publicly available databases is acknowledged by providing their URLs, or not acknowledged at all. Besides, some databases could be cited on the web sites and in new or obscure journals, not covered by the ISI Citation Index.” (Galperin, 2007) He then goes on to describe some alternative measures he has investigated to proxy for this citation problem. This is a real issue, I think; data and dataset citations are not made as often or as consistently as they should be, and advice is often conflicting and itself conflicts with the conflicting standards (of which perhaps the best is the NLM standard). Indeed, the NAR articles describing databases seem to stand proxy for the databases: “the user typically starts by finding a database of interest in PubMed or some other bibliographic database, then proceeds to browse the full text in the HTML format. If the paper is interesting enough, s/he would download its text in the PDF format. Finally, if the database turns to be useful, it might be acknowledged with a formal citation.”
This is probably enough for one blog post, but I’ll return, I think, to have a look at some of these databases in a bit more detail.
BURKS, C. (1999) Molecular Biology Database List. Nucl. Acids Res., 27, 1-9. http://nar.oxfordjournals.org/cgi/content/abstract/27/1/1
GALPERIN, M. Y. (2004) The Molecular Biology Database Collection: 2004 update. Nucl. Acids Res., 32, D3-22. http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D3
GALPERIN, M. Y. (2005) The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33.
GALPERIN, M. Y. (2006) The Molecular Biology Database Collection: 2006 update
. Nucleic Acids Research, 34. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D3
GALPERIN, M. Y. (2007) The Molecular Biology Database Collection: 2007 update. Nucleic Acids Research, 35, D3-D4. http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D3
GALPERIN, M. Y. (2008) The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36, D2-D4. http://dx.doi.org/10.1093/nar/gkm1037
Monday, 21 April 2008
RLUK launched... but relaunch flawed?
"On Friday 18th April the Consortium of Research Libraries (CURL) celebrated its 25th anniversary and launched it new organisational title: Research Libraries UK (RLUK). A warm welcome to RLUK and best wishes for the next 25 years!"Congratulations to them... well, maybe. I had a quick look for some key documents; here's a URL I forwarded to my colleagues a year or so ago: http://www.curl.ac.uk/about/E-ResearchNeedsAnalysisRevised.pdf. Or, more recently, try something on their important HEFCE UKRDS Shared Services Study: http://www.curl.ac.uk/Presentations/Manchester%20November%2007/SykesHEFCEStudy2.pdf. Both give me a big fat "Page not found". In the latter case, when I find their tiny search box, and search for UKRDS, I get "Your search yielded no results".
I am really, desperately sad about this. Remember all the fuss about URNS? Remember all we used to say about persistent IDs? Remember "Cool URIs don't change"? The message is, persistent URIs require commitment; they require care. They don't require a huge amount of effort (its simply a redirection table, after all). But libraries should be in the forefront of making this work. I have emailed RLUK, without response so far. Come on guys, this is IMPORTANT!
Oh and just in case you think this is isolated, try looking for that really important, seminal archiving report, referenced everywhere at http://www.rlg.org/ArchTF/. I had something to do with the RLG merger into OCLC that caused that particular snafu, and after making my feelings known have been told that "We're taking steps to address not only DigiNews links but those of other pages that are still referred to from other sites and personal bookmarks". The sad thing about that particular report, you might discover, is that it doesn't appear to be archived on the Wayback machine, either, I suspect because it had a ftp URL.
[UPDATE The report does now appear on the OCLC website at http://www.oclc.org/programs/ourwork/past/digpresstudy/default.htm. When I first searched for Waters Garrett from the OCLC home page a few weeks ago, I couldn't find it. I guess they haven't quite got round to building the redirection table yet... but that can take time.]
Grump!
Institutional Repository Checklist for Serving Institutional Management
Comments are requested on a draft document from the presenters of the "Research Assessment Experience" session at the EPrints User Group meeting at OR08. The "Institutional Repository Checklist for Serving Institutional Management" lists 13 success criteria for repository managers to be aware of when making plans for their repositories to provide any kind of official research reporting role for their institutional management.Find out more (Note there are at least 3 versions, so comments have already been incorporated).
I liked this: "The numeric risk factors in the second column reflect the potential consequences of failure to deliver according to an informal nautical metaphor: (a) already dead in the water, (b) quickly shipwrecked and (c) may eventually become becalmed."
This kind of work is important; repositories have to be better at being useful tools for all kinds of purposes before they will become part of the researcher's workflow...
Wednesday, 16 April 2008
Thoughts on conversion issues in an Institutional Repository
A few people from a commercial repository solution visited UKOLN last week to talk about their brand and the services they offer. This was a useful opportunity to explore the issues around using commercial repository solutions rather than developing a system in-house, which is where most of my experience with institutional repositories has lain to date.
One thing that particularly caught my interest was the fact that their service accept deposits from authors in the native file format. These are then converted to an alternative format for dissemination. In the case of text documents, for example, this format is PDF. The source file is still retained, but it’s not accessible to everyday users. The system is therefore storing two copies of the file – the native source and a dissemination (in this case PDF) representation.
This is a pretty interesting feature and one that I haven’t come across much so far in other institutional repository systems, particularly for in-house developments. But, as with every approach, there are pros and cons. So what are the pro’s? Well, as most ‘preservationists’ would agree, storage of the source file is widely considered to be A Good Thing. We don’t know what we’re going to be capable of in the future, so storing the source file enables us to be flexible about the preservation strategy implemented, particularly for future emulations. Furthermore, it can also be the most reliable file from which to carry out migrations: if each iterative migration results in a small element of loss, then the cumulative effects of migrations can make small loss into big loss; the most reliable file from which to start a migration and minimise the effect of loss should therefore be the source file.
However, there are obvious problems (or cons) in this as well. The biggest contender is the simple problem of technological obsolescence that we’re trying to combat in the first place. Given that our ability to reliably access the file contents (particularly proprietary ones) is under threat with the passage of time, there’s no guarantee that we’ll be able to carry out future migrations from the source file if we don’t also ensure we have the right metadata and retain ongoing access to appropriate and usable software (for example). And knowing which software is needed can be a job in itself – just because a file has a *.doc suffix doesn’t mean it was created using Word, and even if it was, it’s not necessarily easy to figure out a) which version created the file and b) what unexpected features the author has included in the file that may impact on the results of a migration. This latter point is an issue not just in the future, but now.
Thinking about this led me to consider the issue of responsibility. It’s not unreasonable to think that by accepting source formats and carrying out immediate conversions for access purposes, the repository (or rather the institution) is therefore assuming responsibility for checking and validating conversion outcomes. If it goes wrong and unnoticed errors creep in from such a conversion, is the provider (commercial or institutional) to blame for potentially misrepresenting academic or scientific works? Insofar as immediate delivery of objects from an institutional repository goes, this should - at the very least - be addressed in the IR policies.
It’s impossible to do justice to this issue in a single blog post. But these are interesting issues – not just responsibility but also formats and conversion – that I suspect we’ll hear more and more about as our experience with IRs grows.
Tuesday, 15 April 2008
Wouldn't it be nice...
No guarantees, folks, since the resulting paper has to get through the Programme Committee rather than just me. But it would have a wonderful feeling of symmetry...
Monday, 14 April 2008
Representation information from the planets?
Just to recap, the OAIS formula is that a Data Object interpreted using its Representation Information yields an Information Object. Examples often cite specifications or standards, eg suggesting that the Repinfo (I’ll use the contraction instead of “representation information”) for a PDF Data Object might be (or include) the PDF specification.
Sometimes there is controversy about repinfo versus format information (often described by the repinfo enthusiasts as “merely structural repinfo”). So it’s nice to read a sensible comparison:
"For the purposes of this paper, the definition of a format proposed by the Global Digital Format Registry will be used:This does seem to place repinfo and format information (by this richer definition) in the same class.
“A byte-wise serialization of an abstract information model”.
The GDFR format model extends this definition more rigorously, using the following conceptual entities:
• Information Model (IM) – a class of exchangeable knowledge.
• Semantic Model (SM) – a set of semantic information structures capable of realizing the meaning of the IM.
• Syntactic Model (CM) – a set of syntactic data units capable of expressing the SM.
• Serialized Byte Stream (SB) – a sequence of bytes capable of manifesting the CM.
This equates very closely with the OAIS model, as follows:
• Information Model (IM) = OAIS Information Object
• Semantic Model (SM) = OAIS Semantic representation information
• Syntactic Model (CM) = OAIS Syntactic representation information
• Serialized Byte Stream (SB) = OAIS Data Object"
Time for a short diversion here. I was quite taken by the report on significant properties of software, presented at the workshop by Brian Matthews (not that it was perfect, just that it was a damn good effort at what seemed to me to be an impossible task!). He talked about specifications, source code and binaries as forms of software. Roughly the cost of instantiating goes down as you move across those 3 (in a current environment, at least).
- In preservation terms, if you only have a binary, you are pretty much limited to preserving the original technology or emulating it, but the result should perform “exactly” as the original.
- If you have the source code, you will be able to (or have to) migrate, configure and re-build it. The result should perform pretty much like the original, with “small deviations”. (In practice, these deviations could be major, depending on what’s happened to libraries and other dependencies meanwhile.)
- If you only have the spec, you have to re-write from scratch. This is clearly much slower and more expensive, and Brian suggests it will “perform only gross functionality”. I think in many cases it might be better than that, but in some cases much worse (eg some of the controversy about the MicroSoft-based OOXML standard with MS internal dependencies).
“…Access Software provides a means to interpret a Data Object. The software therefore acts as a substitute for part of the representation information network – a PDF viewer embodies knowledge of the PDF specification, and may be used to directly access a data object in PDF format.”This seems to make sense; again, it’s in the OAIS spec, but hard to find. So Brown proposes that:
“…representation information be explicitly defined as encompassing either information which describes how to interpret a data object (such as a format specification), or a component of a technical environment which supports interpretation of that object (such as a software tool or hardware platform).”Of course the software tool or hardware platform will itself have a shorter life than the descriptive information, so both may be required.
The bulk of the report, of course, is about representation information registries (including format registries by this definition), and is also well worth a read.
Thursday, 10 April 2008
4th Digital Curation Conference: Call for papers
"The first day of the conference will focus on three key topics:There are several main themes:"The second day of the conference will be dedicated to research and development and will feature peer-reviewed papers in themed parallel sessions."
- Radical sharing, new ways of doing science e.g. large scale research networks, mass collaboration, dynamic publishing tools, wikis, blogs, social networks, visualisations and immersive environments
- Sustainability of curation
- Legal issues including privacy, confidentiality and consent, intellectual property rights and provenance
- Research Data Infrastructures (covering research data across all disciplines)
- Curation and e-Research
- Sustainability: balancing costs and value of digital curation and preservation
- Disciplinary and Inter-disciplinary Curation Challenges
- Challenging types of content
- Legal Issues
- Capacity Building
- Submission of papers for peer-review: 25 July 2008
- Submission of abstracts poster/demos/workshops for peer-review: 25 July 2008
- Notification of authors: 19 September 2008
- Final papers deadline: 14 November 2008
- Submission of poster PDFs: 14 November 2008
UKOLN is 30
Several initial talks have been celebratory and retrospective. We were treated to an interesting digression into naval warfare from an ex-commander of the UK aircraft carrier (or through-deck cruiser, as procured) Invincible, now laid up after 30 years (this ex-commander now runs MLA).
Cliff Lynch spoke about the changes in the last 15 years of UKOLN, as the library changed from being focused on doing traditional library tasks better (library automation) to supporting changes in scholarly practices, leading to “social-scale changes”. He was particularly upbeat about some of the recent UKOLN projects with data engagement, including eBank UK and eCrystals, and their role as part of the DCC.
Lorcan Dempsey gave a typically wide-ranging talk, “free of facts or justification”. He did attempt to show us his first PowerPoint presentation, but instead was only able to show us the error report saying that PowerPoint could not open xxx.ppt! I think everyone should do this… Microsoft, please wake up, you’re doing yourself serious damage. He took us through the pressures of concentration and diffusion in the web world, the big squeeze and the big switch. This is the story of “moving to the network level”. Many library activities are starting to be focused externally, but there is too much duplication and redundancy. Libraries, he said, face 3 challenges: giving specific local value, the “one big library on the web” idea (acronym OBLOW shows not entirely serious), and the old one of sorting library logistics so they truly become supportive of the new scholarly process.
Well that’s it up to the break, I’m off for my cup of tea…
Tuesday, 8 April 2008
Seriously Seeking Significance
Andrew Wilson started the day with a great and very targeted keynote that explored the background to significant properties research and what significant properties means to him. I was heartened when he almost immediately kicked off by talking about authenticity and significant properties, because authenticity is a big deal for me insofar as preservation is concerned! But this isn't a post about what we mean by authenticity in a digital environment, I shall save that for another day/time/place. Hot on his heels came Stephen Grace from CeRch, with an overview of the InSPECT project and future work, closely followed by David Duce presenting the conclusions of the Vector Images significant properties study. Interestingly, this used a slightly different measure of significance to the InSPECT project, which used a scale of 0 - 10 whereas the vector images study used from 0- 9. This imay be a minor deviation at the moment but might well assume more relevance when dealing with automated processes for migrating and assessing collections of mixed object types. Mike Stapleton then presented the results of the Moving Images study and, after a short break, on came Brian Matthews with the results of the Software study and Richard Davis with the results of the eLearning Objects study.
It quickly became clear that there was some variance in people's understanding of significant properties, particularly when one speaker stated that for different preservation approaches they would need different significant properties to acheve the desired level of performance. This is different to how I perceive significant properties. For me - and for several other speakers - significant properties define the essence of the object and are those elements that must be preserved in order to retain the ability to reproduce an authentic version of the object. To select different significant properties based on a given preservation approach surely means there is a diferent underlying understanding and use of the term significant properties.
What does this all go to show? That it's a case of different strokes for different folks? Well, to some extent, yes. It was widely accepted several years ago that different sectors had different requirements insofar as preservation was concerned - I remember attending an ERPANET workshop in Amsterdam in 2004, for example, that clearly illustrated just this point. And yesterday's audience and speakers represented an array of sectors with different requirements for preservation. So the whole concept of significant properties and use of the term across different sectors is something that I think we'd benefit from returning to discuss some more.
The afternoons sessions were, I think, intended to put a different perspective on the day. We heard from the PLANETS project, Barclays Bank, the DCC SCARP project, the SIGPROPS project from Chapel Hill, and a presentation on the relationship between Representation Information and significant properties. Cal Lee's presentation (SIGPROPS) on preserving attachments from email messages was fascinating and I suspect I'm not alone in wishing we'd had more time to hear more from him, but there simply wasn't the time. I suspect we would have had much more discussion if the programme had been spread out over two days - the content certainly justified it.
As always, the presentations will be available from the DPC website in due course. Keep an eye out though for the final report of the INSPECT project - they're not finished yet due to the change from the AHDS to CeRCH, but I expect it will be a fascinating read when they are done.
Saturday, 5 April 2008
When is revisability a significant property?
In the analogue world, few resources that we might want to keep over the long term are revisable. Some can easily be annotated; you can write in the margins of books, and on the backs of photos, although not so easily on films or videos. But the annotations can be readily distinguished from the real thing.
In the digital world, however, most resources are at least plastic and very frequently revisable. By plastic I mean things like email messages or web pages that look very different depending on tools the reader chooses. By revisable I mean things like word processing documents or spreadsheets. When someone sends me one of the latter as an email attachment, I will almost always open it with a word processor (a machine for writing and revising), rather than a document reader. The same thing happens when I download one from the web. For web pages, the space bar is a shortcut for “page down”, and quite often I find myself attempting to use the same shortcut on a downloaded word processing document. In doing so, I have revised it (even if in only trivial ways). Typically if I want to save the down-loaded document in some logical place on my laptop, I’ll use “Save As” from within the word processor, potentially saving my revisions as invisible changes.
Annotation is rarer in the digital world, although many word processors now have excellent comment and edit-tracking facilities (by the way, there’s a nice blog post on da blog which points to an OR08 presentation on annotations, and an announcement on the DCC web site about an annotation product that looks interesting, and much of social networking is about annotation, and…).
My feeling is that there’s a default assumption in the analogue world that a document is not revisable, and an opposite assumption in the digital world.
One of the ways we deal with this when we worry about it, at least for documents designed to be read by humans, is to use PDF. We tend to think of PDF as a non-revisable format, although for those who pay for the tools it is perhaps more revisable than we think. PDF/A, I think, was designed to take out those elements that promote revisability in the documents.
If you are given a digital document and are asked to preserve it, the default assumption nearly always seems to kick in. People talk about preserving spreadsheets, worry about whether they can capture the formulae, or about preserving word processing documents, and worry about whether the field codes will be damaged. In some cases, this is entirely reasonable; in others it doesn’t matter a hoot.
When I read the InSPECT Framework for the definition of significant properties, I was delighted to find the FRBR model referenced, but disappointed that it was subsequently ignored. To my mind, this model is critical when thinking about preservation and significant properties in particular. In the FRBR object model, there are 4 levels of abstraction:
- Work (the most abstract view of the intellectual creation)
- Expression (a realisation of the work, perhaps a book or a film)
- Manifestation (eg a particular edition of the book)
- Item (eg a particular copy of the book; possibly less important in the digital world, given the triviality and transience of making copies).
Why is this digression important? Because many of the significant properties of digital objects are bound to the manifestation level. Preserving them is only important if the work demands it, or the nature of the repository demands it. Comparatively few digital objects have major significant properties at the work level. Some kinds of digital art would have, and maybe software does (I haven’t read the software significant properties report yet). If you focus on the object, you can get hung up on properties that you might not care about if you focus on the work.
Last time, I said the questions raised in my mind included:
"* what properties?Here I’m suggesting that for many kinds of works, revisability is not an important significant property for most users and many repositories. That would mean that for those works, transforming them into non-revisable forms on ingest is perfectly valid, indeed might make much more sense than keeping them in revisable form. This isn’t at all what I would have thought a few years ago!
* of which objects?
* for whom?
* for what purposes?
* when?
* and maybe where?"
Friday, 4 April 2008
Adding Value through SNEEPing
Thursday, 3 April 2008
A Question of Authenticity
I’m just on my way back from Wigan, having given a presentation this afternoon on the role of the records manager in digital preservation to a group of, you’ve guessed it, records managers. I was really encouraged to see that the group had decided to dedicate their entire meeting today to tackling the thorny issue of digital preservation. It doesn’t happen very often – usually there’s just the odd session on digital preservation at this type of event – but it’s really great that they recognised from the start that digital preservation can’t be covered in 45 minutes! That’s not to say it can be covered in a day of course, but at least it gave them the opportunity to hear from a number of different speakers, all of whom approached the subject from a different perspective.
One of the things I decided to focus on was the issue of authenticity. It’s a real interest of mine and I’ve been wondering for a while if perhaps we might not be paying as much attention to it as we ought to be, particularly insofar as office records such as text documents, spreadsheets and emails are concerned. The immediate value of this type of record lies, to a great extent, in their evidential value. Their evidential value rests on their authenticity. If the authenticity of the records is compromised then their evidential value is too, and then we run into all sorts of issues like legal accountability and so on.
One of the reasons I think it’s such a relevant issue for records managers is the fact that even small migrations through different applications and application versions have the potential to impact on the authenticity of a record. Such migrations are known to have the ability to alter, for example, automatically generated content such as date and author fields in a text document. In a spreadsheet that derives strings of cell content from embedded formula, a migration of this sort can also impact on the end calculation – particularly if the spreadsheet has been badly created or has errors in it.
The problem is, of course, that migrations on this scale can take place fairly frequently. An organisation may decide to ‘upgrade’ the software they use, or a user may decide to access and develop a shared file with a different application to the one that’s commonly used, with unexpected results. If there’s no attempt to control these circumstances (and I’m not necessarily saying they can be controlled but that we should at least make an attempt to control them) then the risks to a record’s authenticity are increased. And all this before it even reaches an archive.
We talk a lot in our community about Trusted Digital Repositories and preserving digital objects once in the archive. But I’d really love to see some more discussion of how we can ensure records are maintained in an authentic way before they are actually ingested. This is where the records managers – and the records creators – come in. Because otherwise, we run the risk of preserving records whose authenticity could be in question despite their storage/preservation in a TDR. And, because they are stored/preserved in a TDR, their authenticity prior to the point of ingest may never even be questioned and non-authentic records therefore find their way into use.
Wednesday, 2 April 2008
PLATTER
PLATTER stands for the Planning Tool for Trusted Electronic Repositories. The approach defines a series of nine Strategic Objective Plans that address areas considered by the DPE team as essential to the process of establishing trust. Each plan is accompanied by a series of key objectives and goals. Achieving these goals enables a repository to meet the 'ten core principles of trust' defined by the DCC/DPE/NESTOR/CRL early in 2007. Working these goals into the planning stage of the repository - and of course achieving them - should therefore put the repository in a good position to be recognised as a trusted Digital Repository at a later date. This is of course dependent on carrying out a TDR audit such as DRAMBORA, NESTOR or TRAC.
PLATTER is a really comprehensive and well thought-out checklist which has been designed to be flexible enough for use with a range of different types of repository, from IRs to national archives. So whilst it may look a bit daunting at first, it should be adaptable enough that different institutions - which may perceive themselves to have different approaches and requirements for trustworthiness - can use it. However, given that many smaller repositories (particularly IRs) have already decided that preservation - and by implicit association, trust - is something that can be put to one side and addressed at a later date, I find myself wondering just how many of them will actually use this tool when still at the planning stage.
Open Repositories 2008
There were two speakers in this session: Warwick Cathro from the National Library of Australia, and Libby Bishop from the Universities of Essex and Liverpool (yes, she works for both). Warwick gave a fascinating overview of theAustralian Digital Library Service framework, which sets out twenty nine services involved in managing and building digital libraries and repositories. He went into more detail on a number of them, including preservation and in particular the function of obsolescence notification. The Automated Obsolescence Notification System (AONS) that was established by ASPR a few years ago will play a key part in the preservation service. The AONS toolkit includes add-ons for different repository software to build format profiles and the intention is that this will eventually link in to file format registries such as PRONOM and the GDFR to function as part of a migration service (I think). It appears complimentary to the web-based format profiling service that PRESERV established in conjunction with PRONOM - the PRONOM-ROAR service for e-print repositories. Warwick noted however that file format registries need more work before they can comprehensively provide this level of funcitonality though, particularly in providing more structured file format data.
Libby Bishop introduced the Timescapes project and gave a comprehensive overview of the way the research data is collected and used. The project uses a disaggregated preservation service, essentially using the LUDOS repository at Leeds (previously from the MIDESS project) alongside data services provided by the UK Data Archive at the University of Essex. Libby had a lot of say and I'm interested to know more about the level of preservation service that is provided by the UKDA but didn't manage to ask whilst I was there yesterday.
There is another sustainabilty session running today at OR08.
Tuesday, 1 April 2008
National Statistics no joke
Two things initially discouraging: first, a Google search for National Statistics Agency [sic] produces a web page for the old ONS at http://www.statistics.gov.uk/default.asp; this is ALMOST but not quite the same URL as the new authority at http://www.statistics.gov.uk/... but what about the apparently similar but possibly different http://www.statisticsauthority.gov.uk/? Well, teething troubles no doubt. Second there's a prominent link on the (first of the above) home page that "ONS independence comes into effect on 1 April", and the link is broken. More teething...
A quick explore led me to the UK snapshot, with lots of interesting web pages summarising data. As an example, there is a page Headed "Acid Rain" under the "Environment" section; you get a graph and a few paragraphs of text, eg "Emissions of chemicals that can cause acid rain fell by 53.8 per cent between 1990 and 2005, from 6.9 million tonnes to 3.2 million tonnes." In fact, the page doesn't tell us whether rain became less acidic during this period, but again that's a quibble.
But I was looking for data, not summaries. I found some under "Time Series data", but had to go through a complicated sequence of selections to find an actual dataset. I selected Share Ownership, then "Total market value by sector of beneficial owner: end-2006", then "DEYQ, SRS: Ungrossed: Total Market Value:Individuals" before I got a download button. The 3 download options were:
- View on Screen
- Download CSV
- Download Navidata
,"DEYQ",All fine, I think... except I could not find any way to do this automatically. Maybe they have an API I haven't found, maybe they have plans not yet come to fruition. Anyway, some good stuff here, but perhaps room for improvement?
" 1998",154.8,
" 1999",163.3,
" 2000",181.0,
" 2001",148.4,
" 2002",104.0,
" 2003",136.0,
" 2004",122.3,
" 2005",..,
" 2006",155.8,
"DEYQ","SRS: Ungrossed: Total Market Value:Individuals"
,"Not seasonally adjusted"
,"Updated on 8/ 6/2007"