Wednesday 21 October 2009

New issue of IJDC

The latest issue (volume 4, issue 2) of the International Journal of Digital Curation is now available. It's a bumper issue, with two letters to the editor (a whiff of controversy there!), 8 peer-reviewed papers (originating from last year's International Digital Curation Conference), and 6 general articles (two of which came from last year's iPres08 conference). I'm really pleased with this issue, which as always is extremely interesting.

This is the last issue to be produced by Richard Waller as Managing Editor, and I'd like to pay tribute to his dedication in making IJDC what it is today. He has sourced most of the general articles himself, and those who have worked with him as authors will know the courteous detail with which he has edited their work. They may not know the sheer blood, sweat and tears that have been involved, nor the extraordinarily long hours that Richard has put in to make IJDC what it is, alongside his "day job" of editing Ariadne. Thank you so much, Richard.

We will have a new Production Editor for the next issue, whom I will introduce when that comes out (we hope at about the same time as this year's International Digital Curation Conference in London... have you registered yet?). We have some interesting plans to develop IJDC in volume 5, next year.

Update: I thought I should have said a bit more about the contents, so the following is abridged from the Editorial.

Two papers are linked by their association with data on the environment. Baker and Yarmey develop their viewpoint with environmental data as background, but their emphasis is more on arrangements for data stewardship. Jacobs and Worley report on experiences in NCAR in managing its “small” Research Data Archive (only around 250 TB!).

Halbert also looks at elements of sustainability, in distributed approaches that are cooperatively maintained by small cultural memory organizations. Naumann, Keitel and Lang report on work developing and establishing a well-thought out preservation repository dedicated to a state archive. Sefton, Barnes, Ward and Downing address metadata, plus embedded semantics; their viewpoint is that of document author. Gerber and Hunter similarly address metadata and semantics, this time from the viewpoint of compound document objects

Finally, we have two papers loosely linked through standards, though from different points on the spectrum of the general to the particular, as it were. At the particular end, Todd describes XAM, a standard API for storing fixed content; while from the more general end, Higgins provides an overview of continuing efforts to develop standards frameworks.

Moving on to general articles, in this case I would like to mention first my colleagues Pryor and Donnelly, who present a white (or possibly green?) paper on developing curation skills in the community.

Next, I would highlight two very interesting articles that originated from iPres 2008. These are Dappert and Farquahar who look at how explicitly modelling organisational goals can held define the preservation agenda. Woods and Brown describe how they have created a prototype virtual collection of 100 or so of the thousands of CD-ROMs published from many sources, including the US Government Printing Office. Shah presents the second part of his interesting independently-submitted work on preserving ephemeral digital videos. Finally, Knight reports from a Planets workshop on its preservation approach, while Guy, Ball and Day report from a UK web archiving workshop.

Monday 19 October 2009

SUN PASIG: October 2009

As readers of this blog may have guessed, I was in San Francisco for the iPres 2009 Conference (17 blog posts in 2 days is something of a personal record!). This conference was followed by several others, including the Sun Preservation & Archiving SIG (Sun-PASIG), from Wednesday to Friday. I didn't feel quite so moved to blog the presentations as at iPres (and I was also knackered, not to put too fine a point on it). But I did not want to pass it b y completely unremarked, particularly as I really like the event. This is the second Sun-PASIG meeting I've attended, following one in Malta in June of this year (see two previous blog posts).

It's a very different kind of meeting from iPres. The agenda is constructed by a small group, forcefully led by Art Pasquinelli of Sun and Michael Keller of Stanford. The presentations are just that; not papers. This let's them be more playful and pragmatic, also more up-to-date. Of course, there's a price to pay for a vendor-sponsored conference, although I won't reveal here what it is!

Tom Cramer has put up the slides at Stanford, so you can explore things I was less interested in. In the first session, the presentation that really grabbed me was Mark Leggott from Prince Edward Island (I confess, guiltily, I don't really know where this is) talking about Islandora. This is a munge of Fedora and Drupal, with a few added bits and bobs. It looked like a fantastic example of what a small, committed group with ideas and some technical capability can do. Nothing else on day 1 caught my imagination quite so strongly, although I enjoyed Neil Jeffries' update on activities in Oxford Libraries, and Tom Cramer's own newly pragmatic take on a revised version of the Stanford Digital Repository.

On day 2 there were lots of interesting presentations. Of particular interest perhaps was the description of the University of California Curation Center's new micro-services approach to digital curation infrastructure. I'm not quite sure I get all of this, mainly perhaps as so much was introduced so quickly; however as I read more about each puzzling micro-service, it seems to make more sense. BTW I congratulate the ex-CDL Preservation Group on their new UC3 moniker! 'Tis pity it came the same week as the New York Times moan about overloading the curation word (here if you are a registered NYT reader)...

I also very much liked the extraordinary presentation by Dave Tarrant of Southampton and Ben O-Steen of Oxford on their ideas for creating a collaborative Cloud. Just shows what can be done if you don't believe you can't! The slides are here but don't give the flavour; you just had to be there.

In a presentation particularly marked by dry style and humour, Keith Webster of UQ talked about Fez, and shortly after Robin Stanton of ANU talked about ANDS; both very interesting. The day ended with a particularly provocative talk by Mike Lesk, once at NSF for the Digital Library Initiatives, now at Rutgers. Mike's aim was to provoke us with increasingly outrageous remarks until we reacted; if he failed to get a pronounced reaction, it was more to do with the time of day and the earlier agenda. But this is a great talk, and mostly accessible from the slides.

On the 3rd day, we had a summing up from Cliff Lynch, interesting as ever, followed by breakouts. I went to the Data Curation group (surprise!), to find a half dozen folk, apparently mostly from IT providers, very concerned about dealing with data at extreme scale. It's a big problem (sorry), but not quite what I'd have put on the agenda. But in a way it typifies Sun-PASIG: never quite what you thought, always challenging and interesting.

Shortly thereafter I had to leave, but in the middle of a fascinating discussion about the future of Sun-PASIG, particularly with the shadow of the Oracle acquisition looming. I certainly believe that the group would be useful to the new organisation, and very much hope that it survives. Next year in Europe?

Wednesday 7 October 2009

iPres 2009: van Horik on MIXED framework for curation of file formats

Scholars in the Netherlands can deposit or search information in a repository system called DANS EASY, containing about 500,000 files, with a wide diversity of formats. How do I deal with a file called cars.DBF, now an obsolete format. There system can read such formats and convert them to the XML-based MIXED format, which identifies the data type and contains information on structure and content. So this was a smart conversion from the binary, obsolete dbase file to an XML reusable file. In the future it can be converted from this format to a current format of choice. This process (allegedly) does not require multiple migrations…

They have a SDFP community model for spreadsheet and tabular data. Have created some code for DBF and DataPerfect formats that they had to reverse engineer, in SourceForge; this a very labour-intensive activity, and really should be a community effort.

Question: does reverse engineering expose to risk? Don’t know…

iPres 2009: Brown on font problems

They have a very large collection of documents, some of which had Texas Instrument calculator fonts, which had maths symbols in them, but didn’t always render properly with font substitutions. Several other examples, including barcode fonts (where font substitution can give the numeric value, not losing information but losing functionality).

The top 10 fonts in a collection tend to be the same; it’s the long tail of up to 3,000 or so that might be the problem. Font names help a bit but there are huge variations in font names, eg 50+ for Arial alone! In fact, it’s quite difficult to get useful matches from font names with fonts in font tables, some of which have very weak information content. Times new Roman satisfies about 38% of documents in their collection; Windows XP + Word satisfies about 80% of the documents in the collection; the large collection of fonts they assembled would satisfy about 95% of the collection, many more would be needed to build that up higher.

Worst example was a Cyrillic font, called Glasnost-light but rendered as ASCII; the problem was related to the pre-Unicode code space in some way I didn’t understand. A font substitution looked hopeful; it produced Cyrillic, but unfortunately not Russian, as the encoding was different.

Comment: this is a difficult problem much dealt with in the commercial community, who have secret tables. But even Adobe only deals with a couple of thousand fonts.

Tuesday 6 October 2009

iPres 2009: Tarrant, the P2 Registry, Where the Semantic Web and Web 2.0 meet format risk management

P2-registry is a demo of what we can do if we publish in a web 2 fashion. The mainstream here is the web, for the community

Linked data: every slide has links to where the stuff comes from. See the graph on linked data, let’s get in that graph. Using linked data reduces redundancy, facilitates re-use and maximises discovery. The community is not just consumers, also publishers. Because of links to namespaces, this contributes to building trust.

The main node is DBpedia, which is in fact Wikipedia marked up as RDF. Lots of people reference it and link to it. Give URIs to things: Tarrant has a URI; his home page is not him; has a URL that’s not the same (but relates).

4 rules of linked data: use URIs as the names of things; use HTTP URIs so they can be looked up; when someone looks them up, provide useful information, include links to other useful things.

Here, data are facts, facts are represented as triples, in RDF. OWL & RDFS provide means to represent your RDF model. It’s machine readable and validatable. Importing data from multiple domains, you can use OWL to say a thing in one domain is the same as another thing in different domain.. Used PRONOM and Wikipedia to build a small ontology that describes what can be done by different software. The underlying registry is a triple store, it understands RDF, so 19 possible answers are turned into 70 with some data alignment. Then used these data to perform a basic risk analysis on PDF.

Take home message: data hidden in registries is not easily discoverable so is little used, so publish it on the web and it can be much more widely used.

Trust seems an issue in so many name spaces, but hopefully it all works out….

iPres 2009: Kirschenbaum & Farr on digital materiality: access to the computers

This seems to be about the digital equivalent of literary personal papers; an urgency based on the recent deaths of authors like John Updike & others. Based on planning grant funding from NEH, resulting in a deliverable as a White Paper.

Digital objects in this case are artefacts, not just records; both the physical and the virtual require materiality. Some of this is regarding the computers as important parts of the creative context.

Recommendations: keep the hardware and storage media. You can tell things from hand-writing on diskette labels, etc.

Recommendation: Image disks (both pictorial images, but also forensic imaging), see Jeremy Leighton John.

Recommendation: computer forensics (see forthcoming CLIR/Mellon report on Computer Forensics in Cultural Heritage, expected to be available next fall).

Recommendation: document the original environment, eg 360 degree views.

Recommendations: value from interviewing the donors themselves.

Recommendation: since they are balancing lots of needs, they need to put careful thought for interface development.

Recommendation: Scholarly Communication Needs, have to have new tools and methodologies on citation (eg of a tracked change in a Word document), reproduction, copyright and IP issues. White paper available at http://www.neh.gov/ODH

There is a time window open now that may not stay open for long, for computers from the early 1908s!

iPres 2009: Guttenbrunner on Digital Archaeology, recovering digital objects from audio waveforms

Early home computers often used audio cassettes as data media. Quite a bit of such data still exist in audio tapes in various archives, getting in worse and worse condition. Can they migrate the data without the original system in the future?

The system they used is the Philips Videopac+ G7400, basically a video game system released in 1983… and another one (!).

Data are encoded in bitstreams, which in turn are encoded in analogue waveforms (via a microphone/headphone socket pair and an audio cassette system!). They worked out how the waveforms responded to changes in the data (basically reverse-engineering the data encodings; would not have been so easy without a working computer). As a result, they were able to write a migration tool from the audio streams to non-obsolete formats.

It turned out there was already a solution that worked where there was a good signal from the tape, but these were often very old tapes in poor condition, so they implemented a different approach, which worked better.

Using old tapes, the other system recognised found no files. The actual system recovered 6 out of 23. Their new implementation recovered 22 out of 23 files, in some cases with errors. They checked by re-encoding the recovered files (on new tapes) and reloading to the actual system; most had minor errors that could be fixed if you knew what you were doing.

They think their findings are valid for all systems that use audio encodings, although there will be wide variations in encodings and file types, but it’s not extensible to other media types.

iPres 2009: Pennock on ArchivePress

Blogs are a new medium but an old genre, witness Samuel Pepys’ diaries for instance (now also a blog!). But since they are web based, aren’t they already archived through web archiving? However, simple web archiving treats blogs simply as web pages; pages that change but in a sense stay the same. Web archiving also can’t easily respond to triggers, like RSS feeds relating to new postings. Web archiving approaches are fine, but don’t treat the blogs as first class objects.

New possibilities can help build new corpora for aggregating blogs to create a preserved set for institutional records and other purposes. ArchivePress is a JISC Rapid Innovation (JISCRI) project, which once completed will be released as open source. The project started with a small 10-question survey, for which the key question was: which parts of blogs should archiving capture. In descending order the answers were posts, comments, tag & category names, embedded objects, and the blog name & URLs. These findings were broadly in agreement with an earlier survey 9see paper for reference).

Set out to find the significant properties of blogs. Significant properties, they see as in the eye of the stakeholder. First round this includes content (posts, comments, embedded objects), context (including authors & profiles), structure, rendering and behaviour.

To achieve this, they build on the Feed plugin for WordPress, which gathers the content as long as a RSS or Atom feed is available. WordPress is arguably the most widely used, it’s open source, it’s GPL and it has publicly available schemas.

Maureen showed the AP1 demonstrator based on the DCC blogs [disclosure: I’m from the DCC!], including blog posts written today that had already been archived. The AP2 demonstrator (the UKOLN collection) will harvest comments, and resolving some rendering and configuration issues from AP1; and will allow administrators to add new categories (tags?).

It seems to work; there turned out to be more variations in feed content than expected. Configuration is tricky, so must make it easier.

iPres 2009: Collaboration

iPres 2009: Martha Anderson on Enabling Collaboration for Digital Preservation

Collaboration is what you do when you can’t solve a problem by yourself. Digital Preservation is such a problem. That was Martha’s summary of her very interesting presentation recapping the NDIPP so far, and giving some excellent guidelines relating to modes of collaboration. She spoke also about an upcoming National Digital Stewardship collaboration, which if I understood it is based round organisations (government?) taking some shared responsibility for the future of their data.

iPres 2009: Panel on challenges on distributed digital preservation

All the speakers participate in Private LOCKSS Networks (PLNs), although there are others eg Chronopolis. Meta Archive Cooperative is growing slowly, recent new members include Hull in the UK, but has a list of up to 40 potential associates. Alabama Digital Preservation Network (ADPnet?) focuses particularly on being simple and cheap. Canadian library consortium (COPL?) has a PLN with 8 members out of 12 in the consortium.

Organisational challenges on starting up, eg creation of Meta Archive West as new startup versus extension of existing. Issues are the same: organisational, technology and sustainability, of which the first and last are the parts of the iceberg under the water! Some very interesting points made about many aspects of these networks.

iPres 2009: Micah Altman Keynote on Open Data

Open Data is at the intersection of scientific practice, technology, and library/archival practice. Claims that data are at the nucleus of scientific collaboration, and data are needed for scientific replication. Science is not just scientific; it becomes science after community acceptance. Without the data, the community can’t work.

Open data also support new forms of science & education: data intensive science, which also promoted inter-disciplinarity. Open data also democratise science: crowd-sourcing, citizen science, developing country re-use, etc. Mentions Open Lab Notebook (Jean-Claude Bradley), Galaxy Zoo etc.

Open data can be scientific insurance; that little extra bit of explanation makes your own data more re-usable, and can give your project extended life after initial funding ends.

Data access is key to understanding social policy. Governments attempt to control data access “to evade accountability”.

Why do we need infrastructure? [Huh?] While many large data sets are in public archives, many datasets are hard to find. Even problems in professional data archives: links, identifiers, access control, etc. So, core requirements…

  • Stakeholder incentives
  • Dissemination inc metadata & documentation
  • Access control
  • Provenance: chain of control, verification of metadata & the bits
  • Persistence
  • Legal protection
  • Usability
  • Business model…

Institutional barriers: no-one (yet?) gets tenure for producing large datasets [CR: not sure that’s right, in some fields eg genomics etc data papers amongst highest cited]. Discipline versus institutional loyalties for deposit. Funding is always an issue, and potential legal issues raise their heads: copyright, database rights, privacy/confidentiality etc.

Social Science was amongst the first disciplines to establish shared data archives (eg ICPSR, UKDA etc), in the 1960s [CR: I believe as an access mechanism originally: to share decks of cards!]. Mostly traditional data, not far beyond quantitative data. More recently community data collections have been established, eg Genbank etc; success varies greatly from field to field. Institutional repositories mostly preserve outputs rather than data, and most only have comparatively small collections. They provide so far only bit-level preservation, mostly not designed to capture tacit knowledge, and have limited support for data. More recently still, virtual hosted archives are happening: institutionally supported but depositor-branded (?), eg Dataverse Network at Harvard; Data360, Swivel. Some of these have already gone out of business; what does that do to trust re persistence of service & data; can you self-insure through replication?

Cloud computing models are interesting, but mostly Beta, and often dead on arrival or soon after. What about storing data in social networks (which are often in/on the cloud). Mostly they don’t really support data (yet), but they do “leverage” that allegiance to a scientific community.

Altman illustrated a wide range of legal issues affecting data; not just intellectual property, but also open access, confidentiality, privacy, defamation, contract. Traditional ways of handling some of this was de-identification of data; unfortunately this is working less and less well, with several cases of re-identification published recently (eg Netflix problem, Narayan et al). [CR; refreshing to hear a discussion that is realistic about the impossibility of complete openness!]

So instead of de-identifying at the end, we’re going to have to build in confidentiality (of access) from the beginning! Current Open Access licences don’t cover all IP rights (as they vary so widely), don’t protect 3rd party liability, and often mutually incompatible.

Altman ending on issues at intersections, starting with data citation: “a real mess”. At least should be some form of persistent identifier. UNF as a robust, coded data integrity check (approximation, normalisation, fingerprinting, representation). Technology can facilitate persistent identifier [CR: not a technology issue!], deep citation (subsets), versioning. Scientific practices evolve: replications standards, scientific publications standards.

There is a virtuous circle here: publish data, get data cited, encourages more data publication and citation!

Next BKN, which sounds like a Mendeley/Zotero/Delicious-like system, transforming treatment of bibliographies & structured lists of information.

The Dataverse network: an open source, federated web 2.0 data network, a gateway to >35,000 social science studies. Now being extended towards network data. Has endowed hosting.

DataPASS, a broad-based collaboration for preservation.

Syndicated Storage Project: replication ameliorates institutional risk to preservation. Virtual organisations need policy-based, auditable, asymmetric replication commitments. Formalise these commitments, and layer on top of LOCKSS. Just funded by IMLS to take the prototype, make it easier to configure, open source etc.

Prognostication: archiving workflow must extend backwards to research data collection [CR: Yeah!!!]. Data dissemination & preservation increasingly hybrid approach. Strengthening links from publication to data, makes science more accountable. Effective preservation & dissemination is a co-evolutionary process: technology, institution & practice all change in reaction to each other!

Question: what do you mean by extending backwards? Archiving often captured when the research is done; becomes another chore, lose opportunity to capture context. So if the archive can tap into the research grid, the workflow can be captured in the archive.

Question (CR): depositor/re-user asymmetry? It does exist; data citation can help this!

iPres 2009: Kejser on Danish cost model on migration

[CR: missed the start of this posting the last one!]

Using a cost model for digital curation, based on the functional breakdown from OAIS. Multiply break down activities until get to costable components; loos rather frightening. Have use case for digital migration. Cost factors include format interpretation, software provision (development of reader, writer & translator). Interesting data in person weeks for development of migration, eg TIFF to PDF/A as 34.7 person weeks (!!)

Reporting results of some earlier stuff; A-archives dating 1968-1998; very heterogeneous; B & C archives more recent and more homogeneous. Shows results from model predictions and actual costs, differences mostly because the A archives were so hard. Also, for the better archives, the mode did well overall but under-estimated some parts and over-estimated other parts.

Second test case was migration of 6 TB of data in 2000 files (very big ones: 300 MByte each). They bought software; the model over-estimated the “development” time on this basis, but under-estimated the processing, perhaps because of the very big files; throughput was very low.

Overall, they found that detailed cost factors make the model not an accurate predictor (but still useful). Precision an issue; models are inaccurate per se, but sometimes give impression of accuracy.

Searching for studies on format life expectancy and migration frequency [longer and less in my view].

Question: how about software re-use? They cost on a first mover basis. Also migration tools do also become obsolete.

Question: why did you think migrating from PDF was necessary? Hardly a format at risk. Turns out to be a move from proprietary to non-proprietary.

Question on scaling: thousands to hundreds of millions of objects; will these apply. Answer was that they will. [CR: doubt this; biggest flaw in LIFE so far has been devastating scaling problems.]

Monday 5 October 2009

iPres 2009: Wheatley on LIFE3

Paul reviews the two earlier phases of LIFE; LIFE3 is UCL, BL & HATII [disclosure: DCC partner; disclosure, I’m also an “expert” on LIFE panels etc] at Glasgow University. Defined a lifecycle approach to costing, creating a generic model of digital preservation lifecycle. LIFE3 is now trying to create a costing tool based on costing models based on stages of the digital lifecycle. Will use previous LIFE data, also Keeping Research data Safe project.

Tool inputs: content profile, organisational profile and context. Lifecycle stages are creation/purchase, acquisition, ingest, bit-stream preservation, content preservation, and access. Where possible, exploit existing work, eg PLANETS work building on DROID, also FITS tool (?), also looking at DRAMBORA & DAF, plus PLANETS tool PLATO.

A template approach lowers the barrier for non-digital preservation people.

Context: still very much a hybrid world; analogue as well as digital. Non-digital not dying, but usage increasing. Also greater variety of digital content, eg video etc. Resources are currently 20:1 on preservation of non-digital to digital, but will need to move more towards 1:1. Need to think about the risk elements as well as cost elements.

LIFE is also expected in the BL to support preservation planning, eg in purchase/acquire/digitise, and in selecting appropriate preservation strategies. Finally, need to budget for resulting costing [CR: the feedback from prediction versus actual could be very interesting!]

Challenges and request for help: had a simple categorisation of content type & complexity. This has been criticised but without a better example. Hlpe, please. Also need more costing data. Finally, will be trialling models, and we’d like to hear from anyone who might want to participate in this.

iPres 2009: Conway on a Preservation Analysis Methodology

Based on DCC SCARP [disclosure: I'm at the DCC] & CASPAR projects. Need to do preliminary analysis of data holdings, then do a stakeholder and archive analysis. Eg a project started in the 1920s, which started from radio, through radar, later ionosphere studies. Then define a preservation objective, which should be well-defined, actionable, measurable, realistic. Assess this against a particular designated community (DC).

From this design preservation information flows; there are always important elements beyond the actual data that are important, eg software, documentation, database technologies, etc. Then do a cost/benefit/risk analysis. Interesting issue about the nature of the relationship between archivist and the science community (producing and consuming).

They seem not to want to define objectives in science discovery terms (eg gravity wave research from wind profile data) but much more specifically in terms of 11 specific parameters. Describes a rather over-the-top AIP including FORTRAN manuals, to read NetCDF files (maybe I misunderstood this bit).

They then find that this homework makes it easier to interface with DRAMBORA & TRAC for audit & certification, and the PLATTER tool from PLANETS. Work may also help to build business cases for preservation of these data.

Question: How well does this archivist/community relationship scale? Does not require those relationship, but exploit it where it exists. Point is to use all the assets you have.

Question: Different types of infrastructure, eg computer centres; have any taken initiatives themselves? Mostly at present it’s a “found” situation rather than a designed one.

Comment: worth looking at the DRIVER project, with concept of enhanced publication, ie data plus supporting documentation.

iPres 2009: Pawletko on TIPR’s progress towards interoperability

Motivation is to distribute data not just geographically but also across different technologies. Also preserving through software changes; forward migrate to later versions, or replacements. Also to have a succession plan for the case where the repository fails.

TIPR is defining a common exchange format. Involves FCLA using DAITSS, Cornell using ADORE but migrating to FEDORA, NYU using DSpace. FCLA have one AIP per intellectual entity, and they retain the first and the latest representation. Cornell hold one AIP for each representation. NYU also has one AIP (didn’t catch how it works).

Format is called the Repository Exchange Package (RXP) based on METS and PREMIS. Need to work with multiple sources, but contain sufficient data for the receiving repository to do what it needs. Minimal structure is 4 files in a directory. A METS document about the source repository, plus provenance and optional rights, plus the actual representations in the package. The second file contains information about provenance. Then two more PREMIS files (?); finally a files manifest (cf BAGIT). [I’m not sure I’m capturing this well, best look at the PPTs later. But why are the slides blue and yellow mixed up???]

Transfer tests: a broadcast transfer, and a ring transfer. In the latter case, each RXP is ingested, then disseminated and sent on to the next, until it gets back to the first. They have built a lot of stuff, and implemented the broadcast transfer test. Next steps: the ring test, and try different (wacky!) RXPs.

Question: why use METS/PREMIS but not RDF & ORE? Familiarity!

Question: will this work with Bagit? Yes; they use Bagit right now…

iPres 2009: Schmidt on a framework for distributed preservation workflows

Schmidt is associated with the EU PLANETS project, building an integrated system for development & evaluation of preservation strategies. Environment based on service-oriented architecture, with platform, language and location independence.

Basic building blocks are preservation interfaces (verbs). Define atomic preservation activities; low level concepts & actions; light-weight & easy to implement. >50 tools wrapped up as the PLANETS service. Plus digital objects (the nouns): generic data abstraction for modelling digital entities. Minimal & generic model for data management, with no serialisation schema, so perhaps create from DC/RDF, serialise with METS etc.

Digital Object Managers map from source (eg OAI-PMH) to PLANETS Dos. There are PLANETS registry services. There is a workflow engine driven by templates. Developers create workflow fragments, experimenters select fragments and assemble, configure & execute them. Workflows implemented by workflow execution engine (WEE: a level 2 abstraction), which looks at access management etc. (?).

This ends up being, not an out of box solution, but an extensible network of services, but capable of public deployment to allow sharing of resources and results.

Question: are services discoverable by format? Use PRONOM as format registry, also building an ontology that defines which property can be preserved.

Question: Does PLANETS assume a particular preservation strategy? Have tools for emulation and migration.

Question: are tools deployed outside the project? Not yet but trying to figure out.

iPres 2009: Wilkes on preservation in product lifecycle management

Wolfgang Wilkes is part of the EU Shaman project, running from 2008 to 2011, working in libraries & archives, e-science, and engineering environments; this talk covers the latter. Different phases of a product’s life generate different data; lots of thi is required to maintain a product through its life. Important for long-lived products: cars, aeroplanes, process plants. Many jurisdictions have strong legal requirements to keep data; there may also be contractual requirements. There are also economic reasons: a long-lived item needs modifications through its life, that will be helped by such information.

However, these data are very complex, structured data, often in tools with strongly proprietary data formats. Also the players in different phases of the lifecycle are very different. So ingest becomes a process, not an event. And close control over access will be essential because of high IP value.

The project’s focus is to look at the interaction of the Product Lifecycle Management systems and the digital preservation system. So there is a Shaman information lifecycle.

Pre-ingest: creation and assembly, important for capturing metadata and data. May need to transform proprietary information into standards-based (may lose some information, but that’s better than losing all of it!).

Post access: adoption and re-use. May need to transform back from standards to tool-specific formats.

Need to use the PLM system (which captures stuff in its own repository). The preservation system can’t work o its own; needs preservation extensions to the core PLM system. So need additional PLM functions, but also additional DP functions. Open research topics: detailed spec of DP service interface, dealing with distributed archives, capturing & generation of metadata, linking to external ontologies, etc.

iPres 2009: Lowood on why Virtual Worlds are History

Starts with a couple of stories about the end of virtual worlds, and one about a player whose death was accompanied by outpourings of grief; it later turned out that even the player (and her death) were virtual.

Can replay files help? Can relive the actions of a long-dead game by a long-dead player. But even if we save every such replay, we still don’t save the virtual world.

Events in the world that start then end can leave no record; they get deleted and are no longer there to preserve. No newspapers will blow in the wind, no records in dusty digital filing cabinets. Context will have gone, even if you manage a perfect replay reconstruction of the game.

Time to get positive. The “How to get game” project at Stanford started with an artefact donation, has now developed (with others, & NDIPP funding) into the “Preserving Virtual Worlds” project.

Should we try to preserve the game as digital artefact, or the documentation of context? Better not to take this either-or attitude, but it may be forced on us.

Replays can depend on the exact version of game software, which is constantly changing.

Project is taking a multi-stranded approach: saving player movies, crawling sites for documentation, etc. Perhaps can use the facilities of the virtual worlds themselves, eg virtual world coordinates to navigate; can you participate as a player and use this to capture stuff? [I’m not sure I’m getting this right, on the fly!]

What about access? Suggests it’s not a core concern for preservation [wrong??!!]. But they have some techniques that might help here.

Evaluating open world game platform, Sirikata, as a mechanism for preserving some virtual worlds, by moving maps and objects from one game space to another. Exported files from Quake to OpenVRML, then can import into other worlds like Sirikata.

iPres 2009 Conference 2: BRTF-SDPA panel

Brian Lavoie introduces the Blue Ribbon Task Force on Sustainable Digital Preservation and Access. Sustainability is much more than a technical issue; very much an economic issue. The Report of the task force (full disclosure: I’m a member of this TF) is due early next year.

Definition of economic sustainability was in the interim report, published last year. Requires:

  • Recognition of benefits [demand-side]
  • Incentives to preserve [supply-side]
  • Selection to match means to ends
  • Mechanisms to support ongoing allocation of resources
  • Appropriate organisation & governance

Digital preservation is a process where active intervention/investment is needed to reduce the risks to the assets. Mostly this has been seen in technical terms, but we also need to reduce the economic risks, and ensure active support by decision-makers.

Abby Smith moderating the panel, which includes Martha Anderson, Paul Courant (both on the TF), and Tricia Cruse (not on the TF but here acting as control). Although the TF is anglo-centric (albeit US-focused), we think the problem is universal. Questions first on demand-side.

Tricia Cruse on how CDL Preservation program serves UC & UC Libraries; the Libraries certainly see the value, but they are now taking the issue out to the academics. Phrase from Climate Change debate: challenge to preserve in a dynamic environment. Problem is that 3 or 5-year grant doesn’t encourage long-term thinking. Need to re-articulate preservation as a way to help with their current problem, rather than long term.

Martha Anderson on issues that are persuasive to Congress & national government. The most important argument is to demonstrate value to the nation. Again, this is value for now, rather than value for the future. Use for education is often a winning argument. NDIPP has conditions, one of which is “demonstrate they have used the money well” so gaining trust is an issue. Funding subject to changing administration, which does mean arguments have to be slightly longer term.

Paul Courant on how to deal with 5 years being the new forever. Except for those here present (and following remotely) almost nobody cares! Waxing rhapsodic doesn’t well. Preservation is what major institutions are about? There’s a need to express non-monetised value. Libraries grew because bigger collections attract better faculty, but that compact is breaking down; I don’t need to be at an institution to use their resources. Plus the preservation was happening almost by accident, because the books tended to live so. Is the answer, get the first 5-10 years out of the way? Hold on for things for “long enough”, so you can decide later on what is worth while. “Show me the value”. Now for the supply-side.

Tricia: at the administrative level. Can we try scare stories; what is the cost of not preserving. Can tools and services help people to exploit their data better, in ways that bolster reputation? Also, think of data more as publications; could be real incentive to researchers; needs data citation mechanism (eg the DOI for data movement (personally a little unconvinced).

Martha: for cultural and public policy records, what are the barriers? Mass and complexity are barriers: the problem is so major, it’s overwhelming. Strategy is to share the work, share the burdens; this needs a public policy framework. Two aspects; one is change in copyright legislation (eg section 108 report) to give more powers for other libraries to preserve. Second is tax incentive etc for individuals, corporations etc to pay attention to preservation, offsetiing costs, and enhancing motivations to donate.

Paul: what better incentives are there? Design of mechanisms often difficult. General principle is, demand creates supply. So if we can articulate demand, then supply will follow, but well articulated demand means money attached! NSF conditions on data management plans are OK, but Paul is yet to meet the person who does not get the next grant because of doing a bad job previously, so this is perhaps not yet working. But it’s hard to get current money on many potential future uses, especially scholarly ones. An intermediate case is the notion of handoffs: those with current interest and those with future interest are different people, so can we put mechanisms (eg libraries) to deal with that.

Questions from the floor: what’s the economic argument for preserving open access journals, since there’s no financial interest? The only way those journals will become part of the formal record of scholarship, is if we do preserve them. [Mind you, OA journals are surely low-hanging fruit?]

How can economies of scale come into the argument? >500K libraries across the world; does this help or hinder? Too many dispersed, closed efforts. Preserving for everyone introduces the free rider problem, which reduces incentives to pay to preserve if you can live off the activities (and costs) of others. Can we build up networks to coordinate better? [Mind you, uncoordinated is good; avoids coordinated failure…]

Focus on 5 years is no use if the record set is going to be closed for 30 years? Perhaps there is a middle way here; ways to study closed portions of collections in privileged ways. Handoffs might help here (with anonymisation in place), but handoffs represent possible single points of failure.

Does winning the argument tip us over an economic cliff? Claim that costs of digital preservation are orders of magnitude higher than physical preservation [which I dispute!]. Paul: keeping print is much more expensive because of the huge space implications, and that’s a real cliff! Archives might be different in that respect.

To end, Abby asks that each organisation that can answer yes to the 5 impklied question, come see us afterwards!

iPres 2009 Keynote: David Kirsch

Keynote from David Kirsch: Public Interest in Private Digital Records. The Corporation an extremely powerful institution in society; we don’t take enough advantage of it. Would it be enough to save personal communications? He doesn’t think it’s enough… Public interest: 17th century jurist Matthew Hale “When private property is affected with a public interest, it ceases to be juris privati only”. Harvard longest continuously incorporated institutions in US! Corporations now legal persons. Now people want to be corporations! Where is this going?

Aha! There may be a public interest in their private records, but problem in accessing without infringing private rights. Should corporations have a right to be forgotten (apparently part of EU charter of human rights)? Challenge: the digital record of business is at risk. The major legal power of legal discovery means corporations don’t want to create records (something similar in UK when freedom of Information came in: shredders were furiously active). IT Knowledge Management makes corporate records more valuable, but lawyers want them to be destroyed ASAP.

Could corporations see their own self-interest in preserving their records? Can collective action help? Eg Chemical Industry Institute for Toxicology set up to research health impacts of formaldehyde… Possible National Venture Archive?

Possible “stroke of a pen” approach: create a public interest in the private records. Make a national register of Historical Documents. Escrow institutions, make the records “beyond discovery”? Technical redaction or selective invalidation? US taxpayers now own big companies like GM; for $50B, shouldn’t we at least get the records?

3rd possible mechanism: abandoned interest: failed companies lose power to dispose of records? Would need to revise the social contract of corporations. Working with a Silicon Valley venture capital liquidator. Trying to turn a records warehouse to an archive: which boxes do they want? (Looks like they should have hired an archivist!)

Otherwise try elsewhere, eg Canada, Finnland etc. Finally, exploit the general statutes for incorporation, eg use of term NewCo; we need a NewCo for preservation.

Do Something, but Do No Harm…