Digital Curation Blog: August 2008

Thursday, 28 August 2008

Libraries - data centre session at December San Francisco conference!

Rajendra Bose, an erstwhile colleague from the Edinburgh Database group then working on database annotation as part of the DCC research activity, and now at Columbia, brought this opportunity to my attention, and it seems well worth while making more widely known. He writes:

Please consider submitting an abstract to a groundbreaking session involving library - data center collaboration at the 2008 American Geophysical Union (AGU) meeting during 15-19 December in San Francisco (http://www.agu.org/meetings/fm08/): The Library - Data Center Alliance in Earth and Space Sciences

Abstracts to this Session U08 (see below) must be submitted by 10 September 2008 at: http://submissions3.agu.org/submission/entrance.asp

Guest presenters tentatively include:
Christopher Fox, Director, National Geophysical Data Center, USA (confirmed)
Carlos Morais-Pires, European Commission eScience initiative (confirmed)
James Neal, Vice President for Information Services and University Librarian, Columbia University, USA (confirmed)
Lucy Nowell, National Science Foundation Office of Cyberinfrastructure, USA (invited)
Librarians, please note that AGU has just clarified that registration for the annual meeting for librarians will be the same as that for high school teachers. For the upcoming annual meeting in San Francisco December 15-19, 2008, that rate will be $40 for one day and $85 for 2 or more days.

We hope you can join us for this novel session; please contact co-conveners Mark Parsons (parsonsm@nsidc.org) or Rajendra Bose (rbose@columbia.edu) with any questions.

Session U08: The Library - Data Center Alliance in Earth and Space Sciences

Conveners: Mark A. Parsons, National Snow and Ice Data Center, and
Rajendra Bose, Columbia University Center for Digital Research and Scholarship

Description: Preserving, sharing, and understanding the diverse and growing collection of Earth and space science data and information require sustained commitment and diverse expertise. Recent reports from national and international scientific organizations increasingly emphasize professional and collaborative approaches to managing data and information, especially supporting interdisciplinary science. The electronic Geophysical Year (eGY) promotes this professional development and collaboration. In particular, eGY recognizes the conceptual alliance between today's research libraries and scientific data centers, and promotes partnerships, collaboration and even hybrids of these two types of enterprises to meet the Earth science informatics challenge.

Research libraries have a long, sustained, and respected role as curators of Earth science information and knowledge. Yet, in recent decades, scientific data centers have also played an increasingly important role in stewarding Earth science data and information. Libraries seek to extend their expertise to manage new forms of digital publication, including data. Data centers seek to develop sustained, long-term archival systems. It is apparent; the two communities should collaborate to achieve their complementary objectives.

This session aims to bring together members of both the research libraries and the data center communities to survey and compare approaches, philosophies, and long-term strategies for dealing with the problems of managing digital scientific data collections, and invites submissions regarding issues and approaches for archiving, serving, and curating such collections. An emphasis on support of interdisciplinary science is encouraged.

Thursday, 21 August 2008

How to make repositories a killer app for scientists

Cameron Neylon wrote a nice post indirectly addressing the question of how Nature might make Connotea more useful. It's well worth reading for its own merits, but I was so taken by his questions, I thought they might be re-purposed to apply to repositories. As Cameron says "These are framed around the idea of reference management but the principles I think are sufficiently general to apply to most web services". The text below is Cameron's, except with chunks taken out (shown by ellipses ...) and some of my text added [in brackets] (added text may replace some of Cameron's text).

"Any tool must fit within my existing workflows. Once adopted I may be persuaded to modify or improve my workflow but to be adopted it has to fit to start with. ...
Any new tool must clearly outperform all the existing tools that it will replace in the relevant workflows without the requirement for network or social effects. Its got to be absolutely clear on first use that I am going to want to use this instead of ...[some other tool, like a lab web site]. ... And this has to be absolutely clear the first time I use the system, before I have created any local social network and before you have a large enough user base for these to be effective.
It must be near 100% reliable with near 100% uptime. ... Addendum - make sure people can easily ... download their stuff in a form that will be useful even if your service disappears. Obviously they’ll never need to but it will make them feel better (and don’t scrimp on this because they will check if it works).
Provide at least one (but not too many) really exciting new feature that makes people’s life better. This is related to #2 but is taking it a step further. Beyond just doing what I already do better I need a quick fix of something new and exciting. ...
Prepopulate. Build in publically available information before the users arrive. ..."

I think what Cameron's saying is what we all should know by now of the needs of potential repository depositors: make it easier, make it better, make it worth my while.

It's the "how" that's tricky. I don't think this is impossible, but these targets are hard to hit. It's what all the "negative click" Research Repository System attempts have all been about. They do need a re-think, however, if not to fall foul of Cameron's rule #1! So what other ideas are there out there, folks?

Wednesday, 20 August 2008

Adding value to data: eScience conference session

If you nearly had a paper ready for the International Digital Curation Conference in December (the closing date was the end of July), you may still have time to get one in to the special session "Adding value to data – Digital Repositories in the e-Science world" at the 4th IEEE International Conference on eScience, whose deadline has just been extended to 31 August. From the call for papers on the DReSNet blog:

"There is a great, untapped potential for synergies between grid/e-science technologies and a cluster of related systems addressing the management of digital assets in digital libraries and repositories. The digital material generated from and used by academic and other research is to an increasing extent being held in formal data management systems; these systems are variously categorized as digital repositories, libraries or archives, although the distinction between them relates more to the sort of data that they contain and the use to which the data is put, rather than to any major difference in functionality. In many cases, these systems are used currently to hold relatively simple objects, for example an institution’s pre-prints and publications, or e-theses. However, some institutions are beginning to use them to manage research data in a variety of disciplines, including physical sciences, social sciences, and the arts and humanities, as well as the output from various digitisation programmes."

This does look very interesting. They suggest several research challenges, which I think tie in pretty much with our own agenda, although the last is novel to me in the way it's framed:

Digital preservation and curation in research infrastructures
Interoperability
Security
Provenance
Metadata Extraction
Workflow Integration
Architecture of Participation

There's a good programme committee, so the session could be extremely useful. Go to their entry for the submission details...

Tuesday, 19 August 2008

Credit from citing datasets?

Cameron Neylon has a thought-provoking post on his Science in the Open blog, arising from a discussion he had at Scifoo with Michael Eisen of Berkeley:

"Michael [...] felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute."

This seems a strong argument to me. A paper describing a research dataset isn't really a research paper, surely? So it is a round peg in a square hole. But if the alternative is that the dataset can only be cited via a research paper (on some other, probably related topic) that mentions it in passing, then this is likely to be a rather poor proxy. The dataset creator may get little credit, and the research paper authors rather more credit, than either was due.

However, if we move to the situation where more datasets are cited directly, then not only do the dataset creators or providers get the credit that's due to them, but also the credit is the right kind. That is, the citation is recognisably for creating or providing a dataset and not for a specific research contribution. I suppose that papers in the Nucleic Acids Review special issue on datasets are also recognisable as creation/provision papers, not research papers, but few other disciplines have such easily recognisable distinctions. So overall simply using data citations looks like a better bet. As Cameron says:

"So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different."

Amen to that!

Friday, 15 August 2008

Influence the PoWR

I've been very interested in the JISC Preservation of Web Resources project, which is being undertaken jointly by UKOLN at Bath University and the ULCC Digital Archives department. Some good discussions have taken place on their blog. Now I hear from Marieke Guy that the 3rd PoWR workshop is to take place in Manchester on 12 September, 2008:

"The JISC-sponsored Preservation of Web Resources project (JISC-PoWR) will be running its third and final workshop in Manchester. The series of workshops is aimed at the UK HE/FE records management and Web management communities.

The workshop, entitled 'Embedding Web Preservation Strategies Within Your Institution', will be held from 10.30 am - 4pm on Friday 12th September 2008 at the Flexible Learning Space, University of Manchester. It is free to attend and open to all members of HE/FE Institutions and related HE and FE agencies although we may need to restrict the numbers per institution if we are over-subscribed.

The aim of the workshop is to gain and share feedback from institutional Web, information and records managers on the proposed JISC-PoWR handbook, to address ways of embedding the proposed recommendations into institutional working practices and to solicit ideas for further work which may be needed.

Further details and the booking form are available from the JISC-PoWR blog:
http://jiscpowr.jiscinvolve.org/workshops/workshop-3/

The booking deadline is Friday 5th September 2008 and places are limited."

That should be a must-attend for anyone interested in preservation and the web!

Monday, 11 August 2008

New issue of International Journal of Digital Curation

I am very pleased to announce the publication of Volume 3, Issue 1 of the International Journal of Digital Curation, at http://www.ijdc.net/

This is the largest issue so far, including 9 peer-reviewed papers and 8 articles. My thanks to all the contributors, and to Richard Waller for his excellent editorial work.

Papers (Peer-reviewed)

Evolving a Network of Networks: The Experience of Partnerships in the National Digital Information Infrastructure and Preservation Program. Martha Anderson
Toward Distributed Infrastructures for Digital Preservation: The Roles of Collaboration and Trust. Michael Day
Dataset Preservation for the Long Term: Results of the DareLux Project. Eugène Dürr, Kees van der Meer, Wim Luxemburg, Ronald Dekker
Curation of Laboratory Experimental Data as Part of the Overall Data Lifecycle. Jeremy Frey
Towards a Theory of Digital Preservation. Reagan Moore
Challenges and Issues Relating to the Use of Representation Information for the Digital Curation of Crystallography and Engineering Data. Manjula Patel, Alexander Ball
Defining File Format Obsolescence: A Risky Journey. David Pearson, Colin Webb
Data Documentation Initiative: Toward a Standard for the Social Sciences. Mary Vardigan, Pascal Heus, Wendy Thomas
Moving Archival Practices Upstream: An Exploration of the Life Cycle of Ecological Sensing Data in Collaborative Field Research. Jillian C. Wallis, Christine L. Borgman, Matthew S. Mayernik, Alberto Pepe

Articles

The DCC / Regional eScience Collaborative Workshop. Martin Donnelly
The DCC Curation Lifecycle Model. Sarah Higgins
What to Preserve?: Significant Properties of Digital Objects. Helen Hockx-Yu, Gareth Knight
Recycling Information: Science Through Data Mining. Michael Lesk
The Fit Between the UK Environmental Information Regulations and the Freedom of Information Act. Colin Pelton, Mark Thorley
Review: Scholarship in the Digital Age. Chris Rusbridge
Meeting Curation Challenges in a Neuroimaging Group. Angus Whyte, Dominic Job, Stephen Giles, Stephen Lawrie

Thursday, 7 August 2008

Repositories and the CRIS

As I mentioned in the previous post, there has been some discussion in the JISC Repositories task force about the relationship between repositories and Current Research Information Systems (CRIS). Stuart Lewis asserted, for example, that “Examples of well-populated repositories such as TCD (Dublin) and Imperial College are backed by CRISs.” So it seems worth while to look at the CRIS with repositories in mind.

There has been quite a bit of European funding of the CRIS concept; not surprising perhaps, as research funders would be significant beneficiaries of the standardisation of information that could result. An organisation (EuroCRIS) has been created, and has generated several versions of a data model and interchange standard (which it describes as the EC-recommended standard CERIF: Common European Research Information Format), of which the current public version is known as CERIF 2006 V1.1. CERIF 2008 is under development, and a model summary is accessible; no doubt many details are different, but it does not appear to have radically changed, so perhaps CERIF is stabilising. Parts of the model are openly available at the EuroCRIS web site, but other parts require membership of EuroCRIS before they are made available. This blog post is only based on part of the publicly accessible information.

I decided to have a look at the CERIF 2006 V1.1 Full Data Model (EuroCRIS, 2007) document, with a general aim of seeing how helpful it might be for repositories. Note this is not in any sense a cross-walk between the CERIF standards and those applicable to repository metadata.

Quoting from a summary description of the model:

“The core CERIF entities are Person, OrganisationUnit, ResultPublication and Project. Figure 1 shows the core entities and their recursive and linking relationhips in abstract view. Each core entity recursively links to itself and moreover has relationships with other core entities. The core CERIF entities represent scientific actors and their main research activities.
Figure 1: CERIF Core Entities in Abstract View”

The loops on each entity in the diagram above represent the recursive relationships, so a publication – publication relationship might represent a citation, or one of a series, or a revision, etc. Entity identifiers all start cf, which might help make the following extract more understandable:

“The core CERIF entities represent scientific actors (Persons and Organisations) and their main research activities (Projects and Publications): Scientists collaborate (cfPers_Pers), are involved in projects (cfProj_Pers), are affiliated with organisations (cfOrgUnit_Pers) and publish papers (cfPers_ResPubl). Projects involve people (cfProj_Pers) and organisations (cfProj_OrgUnit). Scientific publications are published by organisations (cfOrgUnit_ResPubl) and refer to projects (cfProj_ResPubl), publications involve people (cfPers_ResPubl), organisations store publications (cfResPubl), support or participate in projects (cfProj_OrgUnit), and employ people (cfPers_OrgUnit). To manage type and role definitions, references to classification schemes (cfClassId; cfClassSchemeId) are employed that will be explained separately…”

The model adds other “second-level” entities:

Figure 5: CERIF 2nd Level Entities in Abstract View

As Ian Stuart wrote in the repositories discussion:

“The current repository architecture we have is a full deposit for each item (though some Repositories jump round this with some clever footwork) - which runs straight into the "keystroke" problem various people have identified.

With a CRIS-like architecture, the user enters small amounts of [meta-]data, relevant to the particular thing - be it a research grant; a presentation at a conference; a work-in-progress; a finished article; whatever.... and links them together. It is these links that then allows fuller metadata records to be assembled.”

It should be clear that having this much information available should make populating a repository relatively trivial task. It's not entirely clear how easily the metadata will transfer, although it does appear that Dublin Core metadata has been used in the model, but whether this is mandatory or optional is not clear from this document. I suppose it’s also not clear how much the CERIF standard is used in actual implementations.

I’ll leave the penultimate word to Stuart Lewis, who wrote

“[The CRIS] doesn't replace the repository, but offers a management tool for research. Inputs to the CRIS can include grant record systems, student record systems, MIS systems etc, and outputs can include publication lists and repositories… Now we just need an open source CRIS platform... E.g.:

http://www.symplectic.co.uk/products/publications.html
http://www.atira.dk/en/pure/
http://www.unicris.com/lenya/uniCRIS/live/rims.html”

Note, I have not checked out any of the above products in any detail, although the 3rd one appears to be a non-responsive web-site; use at your own risk! And it should be obvious that anything that links into MIS systems in any institution is likely to require a major implementation effort. It may also suffer from institutional silo effects (links to my institutions management information, but not my collaborators’ information).

Reference:

EuroCRIS. (2007). CERIF2006-1.1 Full Data Model (FDM) Model Introduction and Specification (Recommended standard). http://www.dfki.de/%7Ebrigitte/CERIF/CERIF2006_1.1FDM/CERIF2006_FDM_1.1.pdf

Tuesday, 5 August 2008

Comments on Negative Click Research Repository System

You may remember that I wrote a series of posts about a Research Repository System, aiming to improve deposits by getting repositories to do more that’s useful to the researcher. I had suggested it should contain these elements:

web orientation
researcher identity management
authoring support
object disclosure control
data management support
persistent storage
full preservation archive, and
spinoffs

There has been some interesting commentary on this idea. Some of it was in the comments to this blog, but most came through the JISC Repository and Preservation Advisory Group’s use of Ideascale to discuss issues related to a possible revision of the Repositories Roadmap. I decided to post some of the ideas from the RRS posts to the Ideascale forum, and they attracted comments, both positive and negative, and votes, and I will try to summarise key points here. Note, I have only picked out posts and comments that I consider relevant to this approach; there were many other interesting ideas and comments that you might want to read if you care about repositories.

BTW if you haven’t tried it, I was very impressed by the combination of technologies that Ideascale have put together, and the simplicity and value of its approach, akin to brainstorming but slower! Worth giving a whirl…

Overall, the thrust of the whole conversation seemed to support my feeling that repositories must serve their depositors rather than down-stream users (the institution, the department, colleagues, the public). This links strongly to the theme behind the RRS set of ideas.

The most popular “idea” on the JISC Ideascale site was Andy McGregor’s, not mine, but was related to this theme: “Define repository as part of the user’s (author /researcher/ learner) workflow” (21 voted for, 2 against, net 19). While there was general support, there was some concern as well: ojd20 wrote “I agree with this idea in the sense of ‘take account of things users need to do’, but not in the sense that we can reduce the myriad functions of an HE institution to a small set of flow charts and design repositories to fit those… ‘Familiar processes and tools’ is a much better way of expressing it [than workflow]” The other warning, from Owen Stephens, was “I'm also wary of comments elsewhere from the likes of Peter Murray-Rust on this. He warns (my interpretation) that we need to be careful of doing this as if it complicates the workflow, it just won't happen”. A. Dunning echoed a common theme in saying “Disagree - the repository is a back end data provider that should not be part of the researchers' / learners' workflow - but the service which sits on top of the repository should be part of the workflow.”

Rachel Heery showed another common concern, about possible over-complexity:

“I think the RRS you envisage sounds fantastic and would be a 'good thing', what worries me is the 'function creep' taking us a few miles on from some of the more basic, simpler 'few keystrokes' approach to repositories. The RRS sounds to have many features of a Virtual Research Environment, albeit perhaps a less data centric VRE…”

This echoed a comment from Bill Roberts on one of the original blog posts:

“Seriously, though - the system you've sketched out is quite complicated, not so much in the individual parts of it, (though several of those parts are quite substantial) but in the design vision required to connect up all those features in a way that would be easy and natural to use. Can you identify a subset of parts that would make a useful version 1?”

And Steve Hitchcock also commented on complexity to the same post:

“The irony of the negative click philosophy is that it has led you to produce a visionary but significantly more complex system.”

The surprising (to me) winner among the RRS ideas was “Allow the user fine-grained disclosure/access control to repository objects”. This gained 12 positive and 2 negative votes, for a net 10 votes, and no negative comments. I don’t have a good explanation for why this was so popular, noting that the voters were mainly repository managers or “experts”, especially since some of the corollary ideas were much less popular.

Another idea posted by Rachel Heery, not from RRS but worth taking note of (and getting 11 votes, none against) was “We shouldn't be thinking of repositories as a place.” In particular this meant “'repositories' are best viewed as a 'type' of data store supporting a variety of services, embedded in various workflows.”

I’ll include the Researcher Identity Management idea next, not for its own votes (4 in favour, 2 against), nor for the comments (of which there were none), but because of the linkages I had made between researcher identity and Current Research Information Systems (CRIS). The idea associated with CRIS was from Ian Stuart, and got 13 positive and 4 negative votes: “Repositories are dead, long live repositories” (yes, ideas are not always best titled!). He wrote

“The current repository technology is library/cataloger centric: items are uploaded (usually by a cataloger, not the author), and most of the meta-data is added by a subject specialist. In this model, the author-as-depositor is (at best) just an initiator for a deposit process.

A better solution would be to move towards a Combined [sic] Research Information System [CRIS], where the academic can organise their areas of interest [AOI]; see the research grants they have (and associate them with their AOI); lodge keep-safe copies of work-in-progress, data-sets, talks, ideas for future work, posters, etc (and associate them with grants or AOIs).

From this corpus of data, the academic can indicate what is visible locally (within the research group/department/organisation) and what is available globally... and from that "globally available" pool, an "Institutional Repository" can be assembled.”

The idea that the repository should be more web-oriented got 6 votes. There were no negative comments on the idea itself, but I drew a comparison (echoing Andy Powell) with Slideshare, and Paul Walk reacted against this: “More like the Web - yes! Like Slideshare - no!” He expands on this comment later (see below).

The idea that the repository should be more Web 2.0-like drew 6 positive but 5 negative votes, net 1 vote, possibly the most divided idea of all! I guess it’s one I was much less sure of, but included it because of comments to my earlier posts on this theme. A. Dunning wrote: “It's not so much the repository itself that needs Web2.0 but the range of the services which exploit that data.” And Paul Walk wrote:

“The repository should be able to participate in an interactive Web. It should be entirely possible for someone else to build a remote Web 2.0 service around resource exposed in my repository. This does not preclude me building such a service - if I think I have sufficient mass of interested users etc. but if I *need* to do this because only I can, then I have probably just built another silo.

This is what I meant by my assertion that, in general, Technorati offers us a better model than Slideshare. Someone can build a better, more focussed, domain specific etc. version of Technorati if there is demand for this, without needing to move or copy the resources in their 'source repositories' (blogs, in this case).”

[Paul, I have to say I’m unconvinced so far. A repository has to have the stuff, so that makes it more like Slideshare. And for various reasons, Technorati sucks (searches don’t work properly, lots of things fall off or go wrong). If you’d said Google instead, I might have got it. But searching by reference rather than by inclusion seems like going back to Z39.50 rather than search engine harvesting approaches (or even OAI-PMH). But maybe I’m resolutely getting the wrong end of this stick!]

Richard Davis wanted a foot in both camps:

“IMO, ideally a repo would be susceptible to all sorts of Flickrish RSS, Web API type manipulation - SWORD-like and who knows what else - leading to total personalisation potential of the user experience.

OTOH, providing implementers and users with an acceptable out-of-the-box UI, that might include basic implementations of some voguish features, and widgety ways to configure them - cf. Flickr or Wordpress - is no bad thing either.”

The idea of providing persistent storage got 6 positive and 2 negative votes, net 4 votes. Paul Walk was generally happy:

“I agree that persistence, more or less as you define it here, should be a core property of what I think of as a 'source repository'. In terms of ease of use: read, or 'get' access should be simple and should be provided through HTTP unless there is a very good reason not to. We should aspire to ease of use in terms of ingest and administrative activities.

In terms of synchronisation with offline storage (unless I have misunderstood your point): this is an interesting area, but I'm not sure it is a fundamental or universal function of 'the repository'. I think we are going to see a strange race between the development of mechanisms to do this (e.g. GoogleGears) and the progress in us all becoming so ubiquitously and permanently connected that we don't need this any more…”

The idea of data management support attracted 4 votes, none negative, and no comments. I’m not sure how to interpret this… but it was not the core interest of the group, I guess. I continue to think it an important missing service that falls into repository-space.

I split authoring support into two ideas: publishing and authoring. Of these, “The repository/library should provide support in the publishing process” got 6 votes, none negative. There were some concerns, eg Owen Stephens again “This needs restating
from a user perspective.” He thought there was value in “a system to help streamline the editorial/publication process is fine (if we can persuade academics to use it), but I'm not sure I'd start by building it on the repository - maybe you could, but there are other approaches as well.” In particular, he thought “I see the point about the benefits, but if we are solving problems our end users don't feel need solving then we have a much more difficult job on our hands (and perhaps not substantially different from our current problems with getting researchers to put things into our repository.”

More direct support for authoring was more divisive (5 votes for and 3 against, net 2 votes). In support, Owen Stephens wrote “The availability of research via Open Access would increase if the same systems that provide Open Access also provided, or were integrated with, tools which support the authoring process.” There was a bit of so-what from Ian Stuart: “This is a variation of what Peter Murry-Rust proposed back in '07: google-docs mixed with a CRIS.” [Yes, Ian, explicitly this idea came from some posts by PMR.] And one outright warning from forkel: “my experience is, that feature requests of this sort are exactly the ones which end up in the "users didn't know what they wanted bin" (I'm a developer).”

Finally (almost) I posted the idea that “The repository should be a full OAIS preservation system”. This turned out to be extremely unpopular, with only 2 votes for and 6 against, net -4! I was very disappointed (repositories are surely not for transient stuff that might be here today, gone tomorrow), so re-phrased it as “Repository should aspire to make contents accessible and usable over the medium term”, when again it attracted 2 votes for and this time none against. This was very late in the day, so maybe few people were coming back to vote on new ideas by then. It was clear that part of the problem was the term OAIS; my mistake for assuming in that context that people would have understood the term.

So where does this leave the Research Repository System idea? Not fully intact, I fear, but not shredded, either. Reduce complexity, think services operating on a store, take care to ensure these are services people really want, etc… It’s going to take me a while to work this through, but I will get back to you!

Saturday, 2 August 2008

Morning at the Repository Fringe

I only managed the first couple of hours of the Edinburgh Repository Fringe event [two days ago... but Google went bonkers and decided this was a spam blog meanwhile!] but it was already great fun, and I can see that those who stay the course will have a great time, and learn heaps. Note to self: more conferences should be like this!

Caveat Auditor (she can’t really be Caveat Lector in the flesh) opened proceedings with plenty of controversy: the IR is dead, we did it! But Vive le IR, maybe it has signs of life, maybe it can provide real services to users (depositors) after all, if we all listen hard, think hard, and work together.

The second hour was divided into 3 20-minute mini-sessions, each with 8 separate soapbox “performances” in little alcoves in the wonderful Playfair Library in Old College, Edinburgh University where the event was being held. I went to 3: Les Carr from Southampton wondering if, and if so how we might get the best out of the 3 or more different software platforms available (DSporaPrints?), rather than being locked in to one; how we might end the “platform arms race”. No clear answer but components and services seemed like useful ideas.

Richard Wright from the BBC (who have a particular problem with volume: soon they will be generating a Petabyte every few days... this is serious volume!) was putting together some of the recent work on error rates in large disk farms to warn us we must abandon that comfortable reliance on the underlying storage layer integrity (be afraid), together with the effects of those errors on different file types, particularly newer formats with high degrees of lossless compression. He showed and played to us some examples of un-damaged and damaged files in several different formats. I forget the exact figures, but a half percent error rate in a BMP file shows a smattering of black pixels, whereas in a GIF file there were serious artefacts and visible damage introduced. Same error rate on a WAV file produces a barely audible rustle effect, while on a MP3 files sound is seriously distorted/. Same error rate on a DOC or PDF file, and you get “File damaged, cannot open”. Be very afraid! Should we take many, many copies of compressed files, or fewer copies of uncompressed files?

Dorothea Salo had a soapbox on data and e-science. Maybe the right paradigm here, she asserted, is the archivist rather than the librarian. As a stalking horse, the paradigm was the retiring Prof passing the university archivist’s office, saying “here are the keys to my office, anything there you want you can have”. The archivist may throw lots away, but it’s according to a well-established protocol. The horse falls down (my metaphor may be mixed here) because in the case of data it’s clearly much more important to capture the stuff while the Prof is still around (far too much tacit knowledge and un-documented context). Plus the whole thing about curation needing the domain scientist, not the generalist (so that’s about partnership, right?). Plus the whole rights and ownership and ethics and “I believe in Open Acess- for everyone else” thing is re-doubled in spades for data. Now, I can’t be saying (I’m definitely not saying honest) “Don’t waste your time with data”. I am saying, again I think, listen hard, think hard, and work together.

Well I had a great time and so did everyone else that I could see. Clearly you can’t hold a Repository Fringe anywhere else, so I hope I will see you in Edinburgh next year!

BTW... this is the 100th post on the Digital Curation Blog!

Digital Curation Blog