Digital Curation Blog: September 2008

Tuesday 30 September 2008

iPres 2008 Foundations session and closing remarks

David Rosenthal from Stanford on bit preservation; is it a solved problem? Would like to keep a very large amount of data (say 1 PB) for a very long time (say a century) with a high probability of survival (say 50%). Mean time to data loss predictions are optimistic because they make assumptions (often not published) such as uncorrelated failure (whereas failure is often correlated… how many pictures did we see of arrays of disks in the same room, candidates for correlated failures from several possible causes). Practical tests with large disk arrays show much more pessimistic results. David estimates that to meet his target of a PB for a century with 50% probability of loss would require “9 9s” reliability, which is unachievable. I find his logic unassailable and his conclusions unacceptable, but I guess that doesn’t make them wrong! I suppose I wonder what the conclusions might have been about keeping a Terabyte for 10 years, if we had made the calculation 20 years ago based on best available data, versus doing those calculations now.

(Later: perhaps I’m thinking of this wrong? If David is saying that keeping a PB for a century with a 50% or greater chance of ZERO loss is extraordinarily difficult, even requiring 9 9s reliability, then that’s much, much more plausible, in fact downright sensible! So maybe the practical certainty of loss, forces us to look for ways of mitigating that loss, such as Richard Wright’s plea for formats that fail more gracefully..)

Yan Han from Arizona looked at modelling reliable systems, once you have established data for the components. Some Markov chain models that rapidly got very complex, but I think he was suggesting that a 4-component model might approach the reliability requirements put forward by RLG (0.001%? where did that number come from?). How this work would be affected by all the other error factors that David was discussing.

Christian Keitel from Ludwigsburg quoted one Chris Rusbridge as saying that digital preservation was “easy and cheap”, and then comprehensively showed that it was at least complicated and difficult. He did suggest that sensible bringing together of digital preservation and non-digital archival approaches (in both directions) could suggest some improvements in practice that do simplify things somewhat. So after the gloom of the first two papers, this was maybe a limited bit of good news.

Finally Michael Hartle of Darmstad spoke on a logic rule-based approach to format specification, for binary formats. This was quite a complex paper, which I found quite hard to assimilate. A question on whether the number of rules required was a measure of format complexity got a positive response. For a while I though Michael might be on the road to a formal specification of Representation Information from OAIS, but subsequently I was less confident of this; Michael comes from a different community, and was not familiar with OAIS, so he couldn’t answer himself.

In discussion, we felt that the point of David’s remarks was that we should understand that perfection was not achievable in the digital world, as it never was in the analogue world. We have never managed to keep all we wanted to keep (or to have kept) for as long as we wanted to keep it, without any damage.

At the end of the session, I felt perhaps I should either ask the web archivists to withdraw my quoted remarks, or perhaps we should amend them to “digital preservation is easy, cheap, and not very good”! (In self defence I should point out that the missing context for the first quote was: in comparison with analogue preservation, and we only needed to look out of our window at the massive edifice of the British Library, think of its companions across the globe, and compare their staff and budgets for analogue preservation with those applied to digital preservation, to realise there is still some truth in the remark!)

In the final plenary, we had some prizes (for the DPE Digital Preservation Challenge), some thanks for the organisers, and then some closing remarks from Steve Knight of NZNL. He picked up the “digital preservation deprecated” meme that appeared a few times, and suggested “permanent access” as an alternative. He also demonstrated how hard it is to kick old habits (I’ve used the deprecated phrase myself so many times over the past few days). Steve, I think it’s fine to talk about the process amongst ourselves, but we have to talk about the outcomes (like permanent access) when we’re addressing the community that provides our sustainability.

Where I did part company with Steve (or perhaps, massively misunderstand his intent), was when he apparently suggested that “good enough”… wasn’t good enough! I tried to argue him out of this position afterwards (briefly), but he was sure of his ground. He wanted better than good enough; he felt that posterity demanded it (harks back to some of Lynne Brindley’s opening comments about today’s great interest being yesterday’s ephemera, for instance). He thought if we aimed high we would fail high. Fresh from my previous session, I am worried that if we aim too high we will spend too high on unachievable perfection, and as a result keep too little, and ultimately fail low. But Steve knows his business way better than I do, so I’ll leave the last word with him. Digital Preservation: Joined Up and Working (the conference sub-title) was perhaps over-ambitious as a description of the current state, he suggested, but it was a great call to arms!

iPres 2008 Web archiving

Thorstheinn Hailgrimsson from Iceland, talking about the IIPC. Established to foster development and use of common tools, techniques and standards, and to encourage national libraries to collect web materials. Now have 38 members, heavily concentrated in Europe and US. Currently in Phase 2 and planning Phase 3; Phase 2 has 4 working groups, on access, harvesting, preservation and standards. Achievements include enhancements to Heretrix crawler, tools like Web Curator from BL/NZNL, and Netarchive Curator Tool Suite from Denmark, plus access tools including NutchWAX for indexing, and an open source version of the Wayback machine. Also a WARC standard near the end of the ISO process. Three main approaches: bulk, selective based on criteria, eg publicly funded stuff, and event-based such as around an election, disaster, etc. Iceland with 300,000 population already has 400 million documents, 8 TB data, so larger domains soon get into billions of documents, and many, many terabytes!

Helen Hockx Yu from the BL talking about web archiving in the UK. Started with UKWAC in 2004, which does permission-based selective archiving. Covers 3,700 unique web sites and >11,000 instances with about 2 TB data. Building an Ongoing Web Archiving Programme for the UK domain. Biggest problem really is that legal deposit legislation not yet fully implemented, and not expected until 2010! Failing he legislative mandate, permission-based archiving is slow and typical success rate is 25%. So suspect valuable web sites are already being lost.

Birgit Henriksen from Denmark: they have been archiving some parts of the web since 1998, and following new legislation, have been doing full Danish domains since 2005. In 2008 2.2 billion objects, 71 TB data, 5 FTE staff. Access is only for research and statistical purposes, which is hard to implement. They do bulk harvests quarterly, selective sites more frequently (sometimes daily), and event-based. Events are things that create debate or are expected to be of importance to Danish history. Preservation involved from the start: redundancy in geography, hardware architecture and vendor, storage media and software, and do active bit preservation based on checksum comparisons.

Gildas Illian from BNF in France; they have a legal deposit law since 2006. Now have 12 Billion URIs, 120TB, 7 FTE, 100 curators and partners involved. Partnered with Internet Archive. Provided in-house access since April 2008. Challenge of change: digital curators not a new job, need to change librarianship. Legal Deposit team as “honest broker” between collections/curators and the IT development and operations.

Colin Webb from NLA in Australia: been doing selective web archiving since 1 April, 1996, with negotiated permissions, quality control and access (however, get about 98% permission success). Also full domain harvests each year since 2005 (1 billion files this year). No legal deposit law (yet, despite lobbying), so have not yet provided access to the full domain crawls. Challenges are interconnected: what we want to collect, what we’re allowed to collect, what we’re able to collect (?) and what we can afford to collect.

Colin then talking about the preservation working group of IIPC, set up in 2007 after the initial concerns to get capture organised had abated. Brief: to identify preservation standards and practices applicable to web archives. Raises questions about whether web preservation differs from “ordinary” digital preservation. What about the scale effects? Work plan includes annual survey of web technical environment (browsers needed, etc). Comment suggested how important this was, hoped they would publish results and continue for many years to do so.

Question: can people whose web site is at risk get it preserved? This has started to happen. Also starting to see requests from people who have had disasters, wanting (and getting) help in recovering their web sites! Nice French story about political parties who have re-done their web sites, and now don’t remember what they promised a few years ago!

iPres 2008 Building Trust

Tyler Walters of GIT comparing trust in the banking system with trust in repositories. The upheavals in banking currently show how easily trust can be upset, and how hard it is to regain once lost. Further desk study on trust; 3 major sources (see paper for references). Useful idea of trust antecedents: what comes before an event that helps you trust, subjectively. Trust that endures is more likely to continue if there’s a breach of trust. Trust is a relationship between people, more than institutions. Trust development: engage, listen, frame issues, envision options, commit. Looks like a paper that’s well worth reading; it’s always seemed to me that trust is about much more than certifying against checklists.

Susanne Dobratz from Humboldt University on quality management and trust, work done in the nestor programme. Should we have separate standards and certification mechanisms (as in the CCSDS Repository Audit and Certification work), or build on existing security and quality management standards? Working though a project to assess applicability of these latter standards (ISO 9000, ISO 27000 etc). Did an analysis and a survey with a questionnaire to 53 persons, got responses from 18 (most responded but were not really able to answer the questions). DINI guidelines the most popular, formal standards little used. Standards mostly used as guidelines rather than formally (makes sense to me, having used ISO security standards that way). 94% knew their designated community, but only 53% have analysed the needs of their designated community. 61% were unsure whether a separate standard for trusted repositories was important; similar proportion unsure whether they would go through any certification process. As a result, I think they are interested in creating a smaller, more informal quality standard along the lines of DINI etc.

Sarah Jones on developing a toolkit for a Data Audit Framework: to help institutions to determine what data are held, where they are located and how they are being managed. 4 pilot sites including Edinburgh and Imperial already under way, UCL and KCL in planning. Detailed workflow has been developed as a self-audit tool. Four phases in the audit process, the second being identifying and classifying the assets; looks like major work. Turns out the pilots are related to department level rather than institutions, which makes sense knowing academic attitudes to “the Centre”! I did hear from one institution that it was difficult getting responses in some cases. Simple online tool provided. DAF to be launched tomorrow (1 October) at the British Academy, together with DRAMBORA toolkit.

Henk Harmsen from DANS on the Data Seal of Approval.. Since 2005, DANS has been looking after data in the social sciences and humanities, in the Netherlands. Aim was make archives DANS-proof, ie to follow acceptable standards; initially not sure how to implement. Looked at existing documents, aim to be complementary to all these requirements. Go for guidelines and overview rather than certifying and many details, trust rather than control. So Seal of Approval almost more a Seal of Awareness. Funding agencies beginning to require the Seal. Data producer responsible for quality of research data, repository responsible for storage, management and availability of data, and the data consumer responsible for quality of use. 17 guidelines, 3 for producers, 11 for repositories, 3 for consumers! Requirement 13 says technical infrastructure must explicitly support tasks and functions from OAIS… not necessarily implemented the same way, but using that vocabulary. How to control? Certification takes a long time, major standards, elaborate agencies, high cost.. Trust does need some support, though. First step: Data Seal of Approval Assessment, then other instruments like DRAMBORA etc. Looks like a nice, low impact, light touch first step towards building trust.

iPres 2008 National approaches

A bit late this morning (ahem!), so I missed some of the start of this. A very interesting and thoughtful exposition from Steve Knight of NZNL. They have a recently revised act, and have really taken digital preservation to heart: quote from their CEO “no job un-changed”! It’s not the platform or software, it’s the business approach, and management of provenance. They gave up on ”pointless” discussions on build vs buy, commercial vs open source: what counts is the requirements. In the end they opted for a commercial solution from Ex Libris that is being implemented now (phase 1 to go live 30 October). Need joined up and working together tools; most are being implemented in isolation.

Plenty of discussion about collaboration. Vicky Reich suggested that perhaps collaboration might be more focused with specific issues to solve. For example, Steve Knight had suggested there was a need to decide on registry approaches (PRONOM, GDFR, RRORI). Maybe integrating tools might be another issue. Martha Anderson pointed out that successful partnerships centred around content on the one hand and mutual local benefit on the other.

Comment that the Digital Preservation Coalition in the UK and nestor in Germany and neighbouring German-speaking countries are both very valuable; is there a role for similar long-lived (non-project) pan-national organisations? Similar organisations suggested in Netherlands and Denmark, and also the Alliance for Permanent Access to the Records of Science.

Question from Steve Knight about how we move to a position where there is a market for digital preservation solution?

Monday 29 September 2008

iPres 2008- past, present and future

Heike Neuroth talking about the start of iPres, conceived at IFLA in 2003. Starting in Beijing, then Germany, next the US at Cornell, then back to Beijing, before coming to the UK this year. She thinks this child is growing up unstoppably now. Adam Farquhar commenting on the increasingly practical nature of reports: “doing it”. Tools being improved through experience in use, at scale. More tools in the toolbox, and we can put them together more flexibly. This year at iPres we have full papers, which should allow more people to get involved in the discussion. Patricia Cruse on iPres 2009 to be hosted by CDL in San Francisco, October 5-6, brand new conference facility at UCSF, downtown (immediately after a large, free blue-grass music festival in Golden Gate Park). Theme is “moving into the mainstream”. Some continuity of programme committee. Question on how to better integrate research with iPres, for example the NSF Datanet initiative, which explicitly bridges the science-library divide. Seek to coordinate with the IDCC and Sun PASIG conferences, so we don’t get to hear the same thing each time.

iPres 2008: Risks and Costs

Paul Wheatley of the BL talking about LIFE2, a refinement of the earlier LIFE project. The LIFE model is a map of the digital life cycle from the point of view of the preserving organisation. The LIFE V2.0 model is slightly reorganised and refined, for example including sub-elements. It also has a methodology for application. In addition, there is the beginning of a Generic Preservation Model (GPM), derived from some desk study. Many issues still remain with the revised GPM, but further work will be carried out with the associated experts group, in anticipation of a possible LIFE3 project. Paul commented on how expensive it was to obtain these costs. It’s clear that much still needs to be done before these costs can be used as predictors

Rory McLeod of the BL talking about a risk-based approach to preservation. The 8 Business Change Managers act through the risk assessment to identify the assets, the risks to those assets, and possible reactions to those risks (ie “save” those assets). 23 separate risks were identified, and were aggregated into 6 direct and 2 indirect groups of risks. Physical media deterioration was the major risk, together with technical obsolescence for hand-held media. These two groups were complemented by format obsolescence and 3 further software-related risk groups. The two indirect risks are related to policy. In risk assessment, they are using the AS/NZ 4360:2004 risk standard, plus the DCC/DPE DRAMBORA toolkit.

Richard Wright talking about storage and the “cost of risk”. In early days dropping a storage device meant losing a few kilobytes, now it could be GBytes and years of work. Storage costs declining and capacity increasing exponentially roughly related to Moore’s law (doubling every 18 months). Usage is going up, too, and risk is proportionate to usage, so risk is going up too. Risk proportional to no of devices and to size and to use… plus the more commonly discussed format obsolescence, IT infrastructure obsolescence etc. So if storage gets really cheap, it gets really risky! Control of loss gets most attention: reduce MTBF, make copies, use storage management layers, introduce virtual storage, using digital library technology, etc. Mitigation of loss gets much less attention. Simple files can be read despite errors; more complex compressed files can be extremely fragile. Files with independent units have good properties. Reports work from Manfred Thaller of Koln: one bad byte affects only that byte of a TIFF (does this depend on selected compression?), 2% of a JPEG, and 17% of a JPEG2000. Demonstrated 5 errors on a PNG and a BMP: former illegible, latter has a few dots scattered about. Text files the best: one byte corruption affects only that byte! Risk can always be reduced by adding money: more copies, more devices, more reliable devices, less data per data manager. However, you actually have a finite budget, so the trade-offs are important. Can’t emphasise how important this is: one of the most worrying preconceptions in digital preservation is that the bit preservation element is a solved problem. It isn’t!

Bill Lefurgy of Library of Congress setting the scene for a report on the impact of international copyright laws on digital preservation. Adrienne Muir summarising the 4 country reports. All had preservation-related exceptions to copyright and related laws, however none of them are fully adequate to allow preservation managers to do what they need to do. All inadequate in different ways; UK maybe the strictest (as ever!). Differences in scope (which libraries), purpose and timing of copying, and material types within scope. Different legal and voluntary deposit arrangements; none comprehensive, they are out of date and don’t reflect the digital, online world. Orphan works (those whose copyright owners cannot be identified) a problem. Technical protection measures are a real preservation problem: current laws often prohibit reverse engineering or other circumvention. In the UK, contracts can trump copyright! Access to preserved content is an issue. Rights holders worry about the effects of the market of any legal changes, and aim to prevent such change. Recommendations include: there should be laws and policies to encourage digital preservation.

Questions: should we just ignore the copyright problems like Internet Archive and Google? We don’t have their resources in case there are serious problems, and there are serious potential impacts if we stamp all over these rights. Maybe for high value stuff, rights would be more vigorously pursued by rights holders, but they may also have motivation to preserve themselves. So maybe we should be looking for the stuff that’s most at risk of loss, some of which is low risk from the rights holders.

iPres 2008 Preservation planning session

Starting with Dirk von Suchodoletz from Freiburg, talking about emulation. I’ve always had a problem with emulation; perhaps I’ve too long a memory of those early days of MS-DOS, when emulators were quite good at running well-behaved programs, but were rubbish at many common programs, which broke the rule-book and went straight into the interrupt vectors to get performance. OK, if you’re not that old, maybe emulation does work better these days, and is even getting trendy under the new name of virtualisation. My other problem is also a virtue of emulation: you will be presented with the object’s “original” interface, or look and feel. This sounds good, but in practice the world has moved on and most people don’t like old interfaces, even if historians may want them. I guess emulation can work well for objects which are “viewed”, in some sense; it’s not clear to me that one can easily interwork an emulated object with a current object. Dirk does point out that emulation does require running the original software, which itself may create licensing problems.

Gareth Knight talking about significant properties. We’ve discussed these before; conspicuously absent (is that an oxymoron???) from OAIS, an earlier speaker suggested the properties were less about the object than about the aims of the organisation and its community, than about the object. But Gareth is pursuing the properties of the object, as part of the JISC-funded InSPECT project (http://www.significantproperties.org.uk/). They have a model of significant properties, and are developing a data dictionary for SPs, which (after consultation) they expect to turn into a XML schema. Their model has an object with components with properties, and also agents that link to all of these. Each SP has an identifier, a title, a description and a function. It sounds a lot of work; not clear yet how much can be shared; perhaps most objects in one repository can share a few catalogued SPs. It seems unlikely that most repositories could share them, as repositories would have different views on what is significant.

Alex Ball is talking about the problems in curating engineering and CAD data. In what appears to be a lose-lose strategy for all of us, engineering is an area with extremely long time requirements for preserving the data, but increasing problems in doing so given the multiple strangleholds that IPR has: on the data themselves, on the encodings and formats tied up in specific tightly controlled versions of high cost CAD software, coupled with “engineering as a service” approaches, which might encourage organisations to continue to tightly hold this IPR. An approach here is looking for light-weight formats (he didn’t say desiccated but I will) that data can be reduced to. They have a solution called LiMMA for this. Another approach is linking preservation planning approaches with Product Lifecycle Management. In this area they are developing a Registry/Repository of Representation Information for Engineering (RRoRIfE). Interesting comment that for marketing purposes the significant properties would include approximate geometry and no tolerances, but for manufacturing you would want exact geometry and detailed tolerances.

Finally for this session, Mark Guttenbrunner from TUV on evaluating approaches to preserving console video games. These systems started in the 1970s, and new generations and models are being introduced frequently in this very competitive area. It might sound trivial, but Lynne Brindley had earlier pointed out that one generation’s ephemera can be another generation’s important resource. In fact, there is already huge public interest in historical computer games. They used the PLANETS Preservation Planning approach to evaluate 3 strategies: one simple video preservation, and 2 emulation approaches. It was clear that IPR could become a real issue, as some games manufacturers are particularly aggressive in protecting their IPR against reverse engineering.

In the Q & A session, I asked the panellists whether they thought the current revision process for OAIS should include the concept of significant properties, currently absent. A couple of panellists felt that it should, and one thought that the concept of representation information should be cleaned up first! Session Chair Kevin Ashley asked whether anyone present was involved in the revision of this critical standard, and no-one would admit to it; he pointed out how worrying this was.

iPres 2008 session 2

Oya Rieger has been speaking about their large scale book digitisation processes. They first entered an agreement with Microsoft, and later with Google; they were naturally very disappointed when Microsoft pulled out, although this did give them unrestricted access to the books digitised under that programme. On the down-side, they suddenly found they need 40 TB of storage to manage these resources, and it took a year or so before they could achieve this. Oya related their work to the OAIS preservation reference model, and it was interesting to see not only that infamous diagram, but also a mapping of actual tools to the elements of the process model. It’s worth looking at her paper to see this; I noted that they were using ADORe for the archival storage layer, but there were several other tools that I did not manage to note down.

The next speaker, Kam Woods from Indiana has been talking about problems in making access to, and preserving, large CD collections published by the Government Printing Office in the US. With time, the old method of issuing CDs to individual readers and having them mount and view these on library workstations, has become increasingly impractical, so they needed to make these available online and preserve them. Notable here was the extent of errors in an apparently well-standardised medium. Because of the wide variety of environments, their solution was a virtualisation approach; they have now made over 4000 CDs available online, see www.cs.indiana.edu/svp/.

Brian Kelly spoke very rapidly (40 slides in 15 minutes, but the slides are on Slideshare and a video of the talk will be posted as well) on the JISC PoWR project (Preservation of Web Resources, http://jiscpowr.jiscinvolve.org/). This is a very entertaining talk, and well worth looking up. Brian is not a preservationist, but is a full-blown technogeek discussing the roles of the latest Web 2.0 technologies on his blog, in his role as UK Web Focus. His initial finding at Bath, slightly tongue-in-cheek, was that neither the archivist nor the web-master were initially interested in the preservation of their web presence, although by the end of the project both were much more convinced! This project achieved a strong level of interaction through its several workshops, and will shortly release a guide resource.

Adrian Brown of TNA spoke for his colleague Amanda Spencer, unable to attend, about trying to improve web resource continuity through web archiving. He quoted, I think, an initial finding that 60% of web links in Hansard (the record of parliamentary debate) were broken!

Steve Abrams closed the session with an interesting talk on JHOVE2, currently in development. Most notable was a change from a 1-to-1 match between digital object (file) and format, recognising that sometimes one object has multiple components with different formats. There is an interesting change of terminology here: source units and reportable units, with often a tree relationship between them. Their wiki is at http://confluence.ucop.edu/display/JHOVE2Info/Home .

iPres 2008, session 1

Here are a few notes that I took during the initial iPres 08 session at the British Library this week. The opening keynote was from Dame Lynne Brindley of the BL, describing some of her institutions approaches and projects. She also noted the increase in public awareness of the need for preservation, including the wide coverage of a project with a name like Email Month (although I don't seem able to find this!). One reflection at the end of her talk was that maybe digital preservation as a term was not the most useful at this stage of development (as we have commented on this blog a while ago). Lynne suggested “preservation for access” as an alternative.

Neil Beagrie and Najla Rettberg spoke under the ironic title “Digital Preservation: a subject of no importance?”. Of course, they did recognise the importance (their point related to Lynne’s message), and found echoes of that importance in institutional policies at various levels; however, they were implicit rather than explicit. As an outcome of their study, they have apparently devised a model preservation policy, with 8 generic clauses, although we have not seen this model yet.

Angela Dappert spoke about some fascinating PLANETS work on relating preservation policy both to institutional goals and to a detailed model of the preservation approach, and particularly its risks. It is refreshing to see the emphasis on risks at this meeting: risks, rather than absolutes, imply choices on courses of action dependent on the probability and impact of the danger, and on the resources available. The presentation included some interesting modeling of preservation in context; too detailed to understand fully as she spoke, but worth another look in the proceedings, I think. One interesting observation (to me) was about the idea of “significant properties”: if I understood her point, she suggested that they were less about the digital object itself, rather about the institution’s goals and the intellectual entity the digital object represents.

The next two presentations were attempts at modelling long term preservation in the context of organisational goals, in the context first of the Netherlands KB, and secondly of the Bavarian State Library. These are, I think, signs of the increasing maturity of the integration of long term preservation.

Wednesday 24 September 2008

Data as major component of national research collaboration

This is perhaps the last of my posts resulting from conversations and presentations at the UK e-Science All Hands meeting in Edinburgh. This one relates to Andrew Treloar’s presentation on the Australian National Data Service (ANDS), and its over-arching programme, Platforms for Collaboration, part of the National Collaborative Research Infrastructure Strategy.

There are strong historical reasons why collaboration over a distance is very important for Universities and researchers in Australia. Now the Government seems to have really got the message, with this major investment programme. Andrew described how a basis of improving the basic infrastructure (network and access management) supports 3 programmes, high performance computing, collaboration services, and the Data Commons. The first two of these are equally important, but in from my vantage, I’m particularly interested in the Data Commons.

The latter is provided (or perhaps supported; it’s highly distributed) by ANDS. The first business plan for ANDS is available at http://ands.org.au/andsinterimbusinessplan-final.pdf and will run until July 2009, with ANDS itself expected to run to 2011. The vision for ANDS is in the document “Towards the Australian Data Commons”.

The vision document :

"...identifies a number of longer term objectives for data management:
A. A national data management environment exists in which Australia’s research data reside in a cohesive network of research repositories within an Australian ‘data commons’.
B. Australian researchers and research data managers are ‘best of breed’ in creating, managing, and sharing research data under well formed and maintained data management policies.
C. Significantly more Australian research data is routinely deposited into stable, accessible and sustainable data management and preservation environments.
D. Significantly more people have relevant expertise in data management across research communities and research managing institutions.
E. Researchers can find and access any relevant data in the Australian ‘data commons’.
F. Australian researchers are able to discover, exchange, reuse and combine data from other researchers and other domains within their own research in new ways.
G. Australia is able to share data easily and seamlessly to support international and nationally distributed multidisciplinary research teams. (p. 6) "

Andrew writes:

“ANDS has been structured as four inter-related and co-ordinated service delivery programs:
Developing Frameworks [Monash]
Providing Utilities [ANU]
Seeding the Commons [Monash]
Building Capabilities” [ANU]

Andrew also mentioned the Science and Research Strategic Roadmap Review, published just this August, which seems to centre on NCRIS. This includes the notion of a” national data fabric”, based on institutional nodes.

ARCHER, mentioned earlier, provides candidate technology for an institution participating in ANDS. As Andrew pointed out in a comment responding to my confusion in the last post: “the CCLRC [metadata] schema is the internal schema being use by ARCHER to manage all of the metadata associated with the experimental data. ISO2146 is the schema being used by ANDS to develop its discovery service”. No doubt there are many ways institutional nodes can and will be stitched together, and it will be interesting to see how this develops.

Australians sometimes bemoan the lack of an Australian equivalent of JISC. However, in this case they appear to have put together something with significant coherence in multiple dimensions. On the face of it, this is more significant than any European or US programme I have seen so far. A lot depends on the execution, but with good luck and a following wind (and a fairly strong dose of “suspension of disbelief” from researchers), this could well turn out to be a world-beating data infrastructure programme.

For comparison, I’ll try to take a look at the emerging UK Research Data Service proposals, shortly. Perhaps that has the opportunity to be even better?

Friday 19 September 2008

ARCHER: a component of Australian e-Research infrastructure?

At the e-Science All Hands meeting, David Groenewegen from Monash spoke (PPT, also from Nick Nicholas and Anthony Beitz) about the outputs of the ARCHER project, almost finished, intended to provide tools for e-Research infrastructure. They see these e-Research challenges:

"Acquiring data from instruments
Storing and managing large quantities of data
Processing large quantities of data
Sharing research resources and work spaces between institutions
Publishing large datasets and related research artifacts
Searching and discovering"

The tools they have nearly completed (the project finishes this month) include:

"ARCHER Research Repository - for storing large datasets, based on SRB
Distributed Integrated Multi-Sensor & Instrument Middleware – concurrent data capture and analysis
Scientific Dataset Manager (Web) - for managing datasets
Metadata Editing Tool
Scientific Dataset Manager (Desktop) – for managing datasets
Analysis Workflow Automation Tool - streamlining analysis
Collaborative Workspace Development Tool - bringing researchers together"

The research repository is based around a simplified version of what they call "CCLRC Scientific metadata", which they said was the only appropriate set of metadata they could find at the time. Here's a simpified diagram:

This does look to me like the metadata for a Current Research Information System (CRIS); I've spoken of these before. Maybe the CERIF metadata might do a similar job? Later we heard that the ANDS programme was to use yet another metadata set or standard (ISO2146); OK guys, I got a bit confused at this point!

Clearly this is a generic system; they also have some customisations coming for crystallographers, including the wonderfully-named TARDIS (The Australian Repositories for Diffraction Images). It's not yet clear to me how domain science curation and domain-specific metadat fit into this model, but I hope to find out more next month.

So where next for ARCHER? David wrote:

"Currently testing the tools for release by late September
Some tools already out in the wild and in use
Expecting that the partners will continue to develop the tools they created
New enhanced versions already being worked on
Looking at how these tools might be used within ANDS (Australian National Data Service) & ARCS (Australian Research Collaborative Service) and beyond!"

This does look like good stuff!

(I think ARCHER is a collaboration of ANU, James Cook, Monash and Queensland Universities.)

A national data mandate? Australian Code for the Responsible Conduct of Research

Andrew Treloar pointed to this Code in a presentation at the e-Science All Hands meeting. All Australian Universities have signed up to the Code, which turns out to have a whole chapter on the management of research data and primary materials. It deals with the responsibilities of both institutions and researchers. Mostly it’s in the style “Each institution must have a policy on…”, but it then gets quite prescriptive on what those policies must cover. Here are some quotes, institutional responsibilities first:

“Each institution must have a policy on the retention of materials and research data.”

“In general, the minimum recommended period for retention of research data is 5 years from the date of publication.”

“Institutions must provide facilities for the safe and secure storage of research data and for maintaining records of where research data are stored.”

“Wherever possible and appropriate, research data should be held in the researcher’s department or other appropriate institutional repository, although researchers should be permitted to hold copies of the research data for their own use.”

On researcher responsibilities:

“Researchers should retain research data and primary materials for sufficient time to allow reference to them by other researchers and interested parties. For published research data, this may be for as long as interest and discussion persist following publication.”

“Research data should be made available for use by other researchers unless this is prevented by ethical, privacy or confidentiality matters.”

“Retain research data, including electronic data, in a durable, indexed and retrievable form.”

To discharge these obligations requires training in good curation practice, significant care, and appropriate infrastructure. Maybe these are regarded as yet more "un-funded mandates", to be treated on a risk assessment basis (will we be found out?). Maybe this does represent "the dead hand of compliance", as a senior colleague once phrased it. But if taken in the spirit as written, it represents a significant mandate for data curation!

I don’t know of an equivalent Code elsewhere that is so specific. The nearest US equivalent may be the Introduction to the Responsible Conduct of Research, from the Office of Research Integrity, Dept of Health & Human Services. It tends to be more a rather bland assembly of good advice, than anything prescriptive.

In the UK, the Research Integrity Office says it is “developing a code”, individual Research Councils have Ethics and related policies, while Research Councils UK is consulting in this area; their consultation closes on 24 October, 2008. It does have a section on Management and preservation of data and primary materials:

“… ensure that relevant primary data and research evidence are preserved and accessible to others for reasonable periods after the completion of the research. This is a shared responsibility between researcher and the research organisation, but individual researchers should always ensure that primary material is available to be checked. … Data should normally be preserved and accessible for not less than 10 years for any projects, and for projects of clinical or major social, environmental or heritage importance, the data should be retained for up to 20 years, and preferably permanently within a national collection, or as required by the funder’s data policy.”

The “normal” period is longer, but otherwise, still not as specific and therefore strong as the Australian Code!

Monday 15 September 2008

David de Roure on "the new e-Science"

I was at the eScience All Hands meeting last week, and unfortunately missed a presentation by David de Roure on the New e-Science, an update on a talk he gave 10 months ago. The slides are available on Slideshare, but David has agreed I can share his summary:

"1. Increasing scale and diversity of participation
Decreasing cost of entry into digital research means more people, data, tools and methods. Anyone can participate: researchers in labs, archaeologists in digs or schoolchildren designing antimalarial drugs. Citizen science! Improved capabilities of digital research (e.g. increasing automation, ease of collaboration) incentivises this participation. "You're letting the oiks in!" people cry, but peer review benefits from scale of participation too. "Long Tail Science"

2. Increasing scale and diversity of data
Deluge due to new experimental methods (microarrays, combinatorial chemistry, sensor networks, earth observation, ...) and also (1). Increasing scale, diversity and complexity of digital material, processed separately and in combination. New digital artefacts like workflows, provenance, ontologies and lab books. Context and provenance essential for re-use, quality and trust. Digital Curation challenge!

3. Sharing
Anyone can play and they can play together. Anyone can be a publisher as well as a consumer - everyone's a first class citizen. Science has always been a social process, but now we're using new social tools for it. Evidenced by use of wikis, blogs, instant messaging. The lifecycle goes faster, we accelerate research and reduce time-to-experiment.

4. Collective Intelligence
Increasing participation means network effects through community intelligence: tagging, reviewing, discussion. Recommendation based on usage. This is in fact the only significant breakthrough in distributed systems in the last 30 years. Community curation: combat workflow decay!

5. Open Research
Publicly available data but also the open services and software tools of open science. Increasing adoption of Science Commons, open access journals, open data and linked data (formerly known as Semantic Web), PLoS, ... Open notebook science

6. Sharing Methods
Scripts, workflows, experimental plans, statistical models, ... Makes research repeatable, reproducible and reusable. Propagates expertise. Builds reputation. See Usefulchem, myExperiment.

7. Empowering researchers
Increasing facility with new tools puts the researchers in control - of their software/data apparatus and their experiments. Empowerment enables creativity and creation of new, sharable methods. Tools that take away autonomy will be resisted. Beware accidental disempowerment! Ultimately automation frees the researcher to do what they're best at, but can also be disempowering.

8. Better not perfect
Researchers will choose tools that are better than what they had before but not necessarily perfect. This force encourages bottom-up innovation in the practice of research. It opposes the adoption of over-engineered computer science solutions to problems researchers don't know they have and perhaps never will.

9. Pervasive deployment
Increasingly rich intersection between the physical and digital worlds through devices and instruments. Web-based interfaces not software downloads. Shift towards devices and the cloud. REST architecture coupling components that transcend their application.

10. Standing on the shoulders of giants
e-Science is now enabling researchers to do some completely new stuff! As the pieces become easy to use, researchers can bring them together in new ways and ask new questions. Boundaries are shifting, practice is changing. Ease of assembly and automation is essential."

The presentation is well worth looking at as well for the extra material David includes. I thought the more open and inclusive approach to e-Science (or Cyber-infrastructure) was well worth including here. The word "heroic" appears on his slides in relation to the Grid, which sums up my concerns, I think!

Thursday 11 September 2008

Many citations flow from data...

I've been at the UK e-Science All Hands Meeting in Edinburgh over the past few days (easy, since it's being held in the bulding in which I work!). Lots of interesting presentations; far too many to go to, let alone blog about. But I can't resist mentioning one short presentation (PPT), from Prof Michael Wilson of STFC. His pitch was simple: publishing data is good for your career, especially now. And he has evidence to back up his claims!

Michael and colleagues put together a Psycholinguistic database many years ago, with funding from the MRC. At first (1981) it was available via postal request, then (1988) via ftp from the Oxford Text Archive, now via web access from STFC and from UWA(since 1994). Not much change over the years, little effort, free data, no promotion.

The database is now publicly available, eg at the link above. You may see that users are requested to cite the relevant paper:

Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11.

The vital piece of evidence was a plot of citations over the years (data extracted from Thomson ISI, I hope they don't mind my re-using it):

You'll notice that citations flowed very slowly in the early days, picked up a little after ftp access was available (it takes a few years to get the research done and published), and then really started to climb after web access was provided. Now Michael and his colleagues are getting around 80 citations per year!

To ram home his point, Michael did some quick investigations, and found "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets".

There was some questioning on whether citing data through papers (rather than the dataset itself) was the right approach. Michael is clear where he is on this question: paper citations get counted in the critical organs by which he and others will be measured, so citations should be of papers.

Summary of the ptich: data publishing is easy, it's cheap, and it will boost your career when the new Research Evaluation Framework comes into effect.

Wednesday 10 September 2008

Curation in University mission!

The new Strategic Plan for the University of Edinburgh declares:

"The mission of our University is the creation, dissemination and curation of knowledge."

Result!

Monday 8 September 2008

OAIS revision moving forward?

Just over a year ago, in late August 2007 I wondered what was happening with the required review of the Open Archival Information System standard, which was announced in June 2006, and for which comments closed in October 2006. Well, there is at last some movement. Just recently, the DCC and the Digital preservation Coalition received notice of the "proposed dispositions, with rationale, to the suggestions which your organisation sent in response to the request for recommendations for updates to the OAIS Reference Model (ISO 14721)". According to the email from John Garrett, Chair of the CCSDS Data Archiving and Ingest Working Group,

"If you have feedback on these proposed dispositions please email them as soon as possible, and by 30th November 2008 at the latest, [...]

A revised draft of the full OAIS Reference Model is expected to be available on the Web in January 2009. There will then be a period for further comment before submission to ISO for full review."

Although a fair few of the dispositions are "No changes are planned", a large number of changes are also proposed relating to the comments made by DCC/DPC. We have not yet had a chance to review them in detail, nor even to decide yet what the mechanism for this will be. But I am very much encouraged that progress is at last being made, and that more opportunities to interact with the development of this important standard will be available, even if it has not proved possible to find out the venue where the proposed changes have been discussed!

Digital Curation Blog