Digital Curation Blog: Institutional Repositories

Showing posts with label Institutional Repositories. Show all posts

Monday, 9 March 2009

Repository preservation revisited

Are institutional repositories set up and resourced to preserve their contents over the long term? Potentially contradictory evidence has emerged from my various questions related to this topic.

You may remember that on the Digital Curation Blog and the JISC-Repositories JISCmail list on 23 February 2009, I referred to some feedback from two Ideas (here and here) on the JISC Ideascale site last year, and asked 3 further questions relating to repository managers’ views of the intentions of their repositories. Given a low rate of response to the original posting (which asked for votes on the original Ideascale site), I followed this up on the JISC-Repositories list (but through oversight, not on the blog), offering the same 3 questions in a Doodle poll. The results of the several different votes appear contradictory, although I hope we can glean something useful from them.

I should emphasise that this is definitely not methodologically sound research; in fact, there are methodological holes here large enough to drive a Mack truck through! Nevertheless, we may be able to glean something useful. To recap, here are the various questions I asked, with a brief description of their audience, plus the outcomes:

a) Audience, JISC-selected “expert” group of developers, repository managers and assorted luminaries. Second point is the same audience, a little later.
Idea: “The repository should be a full OAIS [CCSDS 2002] preservation system.” Result 3 votes in favour, 16 votes against, net -13 votes.
Idea: “Repository should aspire to make contents accessible and usable over the medium term.” Result: 13 votes in favour, 1 vote against, net +12 votes.
b) Audience JISC-Repositories list and Digital Curation Blog readership. Three Ideas on Ideascale, with the results shown (note, respondents did not need to identify themselves):
My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years). Result 2 votes in favour, none against.
My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years). Result 5 votes in favour, none against.
My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years). Result 8 votes in favour, 1 vote against, net +7 votes.
A further comment was left on the Digital Curation Blog, to the effect that since most repository managers were mainly seeing deposit of PDFs, they felt (perhaps naively) sufficiently confident to assume these would be useable for 10 years.

c) Audience JISC-Repositories list. Three exclusive options on a Doodle poll, exact wording as in (c), no option to vote against any option, with the results shown below (note, Doodle asks respondents to provide a name and most did, with affiliation, although there is no validation of the name supplied):
My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years). Result 1 vote in favour.
My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years). Result 0 votes in favour.
My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years). Result 22 votes in favour.

I guess the first thing is to notice the differences between the 3 sets of results. The first would imply that long term is definitely off the agenda, and medium term is reasonable. The second is 50-50 split between long term and the short/medium term combination. The third is overwhelmingly in favour of long term (as defined).

By now you can also see at least some of the methodological problems, including differing audiences, differing anonymity, and differing wording (firstly in relation to the use of the term “OAIS”, and secondly in relation to the timescales attached to short, medium and long term). So, you can draw your own conclusions, including that none can be drawn from the available data!

Note, I would not draw any conclusions from the actual numerical votes on their own, but perhaps we can from the values within each group. However, ever hasty if not foolhardy, here are my own tentative interpretations:

First, even “experts” are alarmed at the potential implications of the term “OAIS”.
Second, repository managers don’t believe that keeping resources accessible and/or usable for 10 years (in the context of the types of material they currently manage in repositories) will give them major problems.
Third, repository managers don’t identify “accessibility and/or usability of its contents for the long term” as implying the mechanisms of an OAIS (this is perhaps rather a stretch given my second conclusion).

So, where to next? I’m thinking of asking some further questions, again of the JISC-Repositories list and the audience of the Digital Curation Blog. However, this time I’m asking for feedback on the questions, before setting up the Doodle poll. My draft texts are

My repository is resourced and is intended to keep its contents accessible and usable for the long term, through potential technology and community changes, implying at least some of the requirements of an OAIS.
My repository is resourced and is intended to keep its contents accessible and usable unless there are significant changes in technology or community, ie it does not aim to be an OAIS.
Some other choice, please explain in free text…

Are those reasonable questions? Or perhaps, please help me improve them!

This post is made both to the Digital Curation Blog and to the JISC-repositories list...

OAIS: CCSDS. (2002). Reference Model for an Open Archival Information System (OAIS). Retrieved from http://public.ccsds.org/publications/archive/650x0b1.pdf.

Monday, 23 February 2009

Repositories and preservation

I have a question about how repository managers view their role in relation to long term preservation.

I’m a member of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (hereafter BRTF). At our monthly teleconference last week, we were talking about preservation scenarios, and I suggested the Institutional Repository system, adding that my investigations had shown that repository managers did not (generally) feel they had long term preservation in their brief. There was some consternation at this, and a question as to whether this was based on UK repositories, as there was an expressed feeling that US repositories generally would have preservation as an aim.

My comment was based on a number of ad hoc observations and discussions over the years. But more recently I reported in an analysis of commentary on my Research Repository System ideas on discussions that had taken place on Ideascale last year, during preparatory work for a revision of the JISC Repositories Roadmap.

In this Ideascale discussion, I put forward an Idea relating to Long Term preservation: “The repository should be a full OAIS preservation system”, with the text:

“We should at least have this on the table. I think repositories are good for preservation, but the question here is whether they should go much further than they currently do in attempting to invest now to combat the effects of later technology and designated community knowledge base change...”

See http://jiscrepository.ideascale.com/akira/dtd/2276-784. This Idea turned out to be the most unpopular Idea in the entire discussion, now having gathered only 3 votes for and 16 votes against (net -13).

Rather shocked at this, I formulated another Idea, see http://jiscrepository.ideascale.com/akira/dtd/2643-784: “Repository should aspire to make contents accessible and usable over the medium term”, with the text:

“A repository should be for content which is required and expected to be useful over a significant period. It may host more transient content, but by and large the point of a repository is persistence. While suggesting a repository should be a "full OAIS" has not proved acceptable to this group so far, investment in a repository and this need for persistence suggest that repository managers should aim to make their content both accessible and usable over the medium (rather than short) term. For the purposes of this exercise, let's suggest factors of around 3: short term 3 years, medium term around 10 years, long term around 30 years plus. Ten years is a reasonable period to aspire to; it justifies investment, but is unlikely to cover too many major content migrations.

“To achieve this, I think repository management should assess their repository and its policies. Using OAIS at a high level as a yard stick would be appropriate. Full compliance would not be required, but thought to each major concept and element would be good practice.”

This Idea was much more successful, with 13 votes for and only one vote against, for a net positive 12 votes. (For comparison, the most popular Idea, “Define repository as part of the user’s (author/researcher/learner) workflow” received 31 votes for and 3 against, net 28.)

Now it may be that the way the first Idea was phrased was the cause of its unpopularity. It appears that the 4 letters OAIS turn a lot of people off!

So, here are 3 possible statements:

1) My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years)
(http://jiscrepository.ideascale.com/akira/dtd/14100-784 )

2) My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years)
(http://jiscrepository.ideascale.com/akira/dtd/14101-784 )

3) My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years).
(http://jiscrepository.ideascale.com/akira/dtd/14102-784 )

Could repository managers tell me which they feel is the appropriate answer for them? Just click on the appropriate URI and vote it up (you may have to register, I’m not sure).

(ermmm, I hope JISC doesn’t mind my using the site like that… I think it’s within the original spirit!)

(This was also a post to the JISC-Repositories list)

Friday, 13 February 2009

Repositories and the web

There has been an interesting series of posts on repository architectures, implementations and implcations, by Andy Powell on the eFoundations blog. He started wondering if multiple deposit diminished Googlejuice, and when I used an article of mine in comments, to illustrate that Google did cope better than expected (in some cases, at least, we were to discover), he had a look at the relevant DSpace-based repository and reported his horror at what he saw. A later post looking at an ePrints-based repository showed continuing concern, although slightly less so. Both posts attracted multiple comments, including an interesting extempore sketch of an OAI-ORE-based solution to one identified problem, by Herbert van der Sompel.

The discussion then kicked across into a closed JISC repository discussion list. It's not fair to report that conversation here, but I can at least quote from one of my comments:

"Part of Andy's original point (he's being saying this a lot), was that the repository platforms we use (DSpace, ePrints et al) are not good web-natives. He did a second post looking at [an] ePrints-based site [...], which showed some improvement over DSpace, but some related issues.

[...] Let's remember the list of problems that Andy spotted (in only a cursory examination; there may be more). He's at the DSpace splash page for my item:
a) PDF instead of (or not as well as) HTML for the deposited item
b) confusion over 4 different URIs
c) wrong Title, so Delicious etc won't work well
d) no embedded metadata in the HTML
e) no cross-linking of keywords (cf Flickr tags)
f) ditto author, publisher
g) unhelpful link text for PDF and additional material etc.
I think if those responsible for these repositories were building them today as web sites, they would not have constructed them with (all of) those deficiencies. (Some are, I think, matters of design choice, or at least in the vanguard of web thinking, eg (e).) But the repository platform brings (or is intended to bring) advantages of workflow, scalability, additional features such as OAI-PMH if it is actually useful, etc. It's supposed to save the repository manager the effort of building a purpose web site, and to provide some consistency across different repositories, as opposed to the general confusion of University web sites in general.

So one approach, as I suggested in an earlier email, is simply to invest to improve the repository platforms, so that they are better web natives, using up to date technologies, and so present a more appropriate web presence. Many of those issues could be solved with some software effort, benefiting everyone who uses DSpace (and similar effort could improve ePrints etc). Just add money and time.

But the multi-URI thing is a bit more of a problem; it sounds like the repository model we use is conceptually broken here. Herbert's solution sounds like a major design change for repository world. Move from depositing items in a FRBR sense (I know Andy has some issues with this) to depositing some thing much more akin to works, instantiated as Resource Maps. I think this is sufficiently different that there would be some serious debate about it. It certainly sounds like it would make sense, although I'm not sure how much good it will do in Googlejuice terms (Google dropped the OAI-PMH maps, so no huge reason to believe they will be interested in OAI-ORE Resource Maps per se, although this may not be the issue).

It would be very difficult for repository managers to craft those Resource Maps by hand in today's world (although clearly Oxford is doing SOMETHING with them). However, it does sound like the sort of thing that repository platforms could do fairly automatically for you. But again there would be significant development work involved. So in this case, we first have an effort to refine Herbert's sketch to a shared view across the repository world, followed by an effort to implement in major platforms. So add money twice, and more time.

Of course, Herbert's probably built it by now!

So my proposal to JISC would be:
invest in upgrading the common UK repository platform software for better web-nativeness (!)
invest in an effort along the lines of previous SWAP/CRIG approaches to get consensus in the Resource Map approach
invest in upgrading the common UK repository platform software to support this approach."

I'm pretty sure some of these issues are also part of what reduces the usefulness of these platforms for data sharing.

Thursday, 5 February 2009

A National Research Data Infrastructure?

Two weeks without a post? My apologies! How about this piece of speculation that has been brewing at the back of my mind for some time...

It is clear that a national research data infrastructure is needed, but there are problems with all of the approaches taken so far to address this. Subject data centres provide subject domain curation expertise, but there are scalability issues across the domain spectrum: it appears unlikely that research funders will extend their funding to a much larger set of these data centres (indeed the AHDS experience might suggest a concern to cut back). Institutional data repositories are being explored, but while disclosing institutional data outputs might provide sustainability incentives, and such data repositories might be managed at a storage level by developments from existing institutional library/archive and IT support services, it is difficult to see how domain expertise can be brought to bear from so many domains across so many disciplines. Meanwhile, various of the studies done by UKOLN/DCC with Southampton University suggest the value of laboratory or project repositories in assisting with curation in a more localised context.

To square this circle, perhaps we have to realise that the storage infrastructure and the curation expertise are orthogonal issues. It is reasonable to suggest that institutions, faculties, departments, laboratories or projects should manage data repositories, databases etc with varying degrees of persistence. But in terms of the curation of the data objects (ie aspects of appraisal, selection, retention, transformation, combination, description, annotation, quality etc), somehow expertise from each domain as a whole has to be brought to bear. It is tempting to think of this as a parallel with the "editorial board" function of a "virtual data journal". This would clearly only be scalable if it were managed across the sector, rather than individually for each data repository.

So we might suggest a federation of repositories on the one hand, and a collective organisation (or set of differing collective organisations) of curation expertise in different disciplines or domains on the other hand; the latter is referred to below as national curation mechanisms.

In such a system, we might see roles for some of the main stakeholders as follows:

Research funders define their policies and mandates, and their compliance mechanisms; Research Information Network to participate and assist?
Publishers likewise define their own policies and mandates with regard to data supporting publication; JISC and RIN could assist coordination here.
Research Institutions define their own policies, establish local research data infrastructure, and encourage appropriate researchers to participate in national curation mechanisms (could participation in these become an element of researcher prestige as membership of editorial boards currently is?).
Researchers are responsible for their own good practice in managing and curating their data (possibly in laboratory or project data repositories or databases), and where appropriate for participating in national curation mechanisms.
Subject/discipline domain data centres are responsible for managing and curating data in their domain, as required by their funders. They also assist in defining good practice in their domains and more widely, undertake community proxy roles (NSB, 2005), and participate in national curation mechanisms.
A number of bodies such as the Digital Curation Centre and the proposed UK Research Data Service could undertake some coordination roles in this scenario, and could also undertake good practice dissemination and skills development through knowledge exchange activities.

(NSB. (2005). Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century. Retrieved from http://www.nsf.gov/pubs/2005/nsb0540/)

What do you think? Is that at all plausible?

Tuesday, 30 December 2008

Gibbons, next generation academics, and ir-plus

Merrilee on the hangingtogether blog, with a catchup post about the fall CNI meeting, drew our attention to a presentation by Susan Gibbons of Rochester, on studying next generation academics (ie graduate students), as preparation for enhancements to their repository system, ir-plus.

I shared a platform with Susan only last year in Sydney, and I was very impressed with their approach of using anthropological techniques to study and learn from their user behaviour. In fact, her Sydney presentation probably had a big influence on the ideas for the Research Repository System that I wrote about earlier (which I should have acknowledged at the time).

Now their latest IMLS-funded research project takes this even further, as preparation for collaborative authoring and other tools integrated with the repository. If you're interested in making your repository more relevant to your users, the report on the user studies they have undertaken so far is a Must-Read! I don't know whether the code at http://code.google.com/p/irplus/ is yet complete...

Monday, 30 June 2008

Repository Fringe

Robin Rice passed on this announcement about Edinburgh's newest festival, the Repository Fringe. She writes: "the event is being jointly planned by Edinburgh and Southampton to coincide with the 'preview week' before the opening of the Edinburgh Festival Fringe. In the spirit of the Fringe, it's a kind of an 'unconference' and we're encouraging people to sign up to participate in various ways on the wiki, e.g. by leading a 'performance' of some sort - a soapbox, a group improv, an Audience With, or as part of the Poster Bazaar." She also tells me that Dorothea Salo from Caveat Lector is giving the keynote. Here is the original email...

"Enjoy fresh debate and face-to-face discussion at Edinburgh's newest and entirely unofficial festival - the Repository Fringe! We would like to invite all repository researchers, managers, administrators & developers to come to Edinburgh for a 2-day workshop on 31st July - 1st August.

The event is free to attend - you just need to register and find your own way up to Edinburgh.

The event will be an informal opportunity to:
discuss new and emerging repository issues
generate new ideas and perspectives
show off some new project results
meet up with like-minded repository managers, administrators, librarians and developers
Research data, social networking, bibliographic services or desktop integration? Whatever your interest and problems, come and give a presentation, kickstart some activity or work on a demonstration.

Put this event in your diary and sign-up online to book your place. Visit the wiki to see the latest programme or register your interest.

Online Registration - www.regonline.co.uk/63349_631697J
Accomodation booking - http://www.repositoryfringe.org/accommodation.html
Website http://www.repositoryfringe.org/
Wiki - http://wiki.repositoryfringe.org/

Hope to see you in Edinburgh!

Theo Andrew
Digital Library Section
Library & Collections
The University of Edinburgh
www.era.lib.ed.ac.uk"

Sunday, 15 June 2008

Reaction to Negative Click Repositories

I was very pleased with reactions to the Negative Click Repositories post. I’ll come back to the idea itself in a later post, but this just attempts to gather some of the comments together (the first few are comments to the original post, but I know my blogreader doesn’t expose those easily).

Owen from Imperial writes

“We clearly need to look at ways of adding value (for the researcher) to anything deposited, as well as minimising the barriers to deposit (perhaps following up on the suggestion of integrating 'repositories' into the authoring process).”

Chris Clouser reported they had been struggling with similar ideas:

“We tried to come up with some way to identify what it was, and eventually settled on the phrase "repository-as-scholarly-workbench." It's not perfect, but it worked for us.”

Gavin Baker wrote:

“I'm very interested in ways to integrate deposit with the normal workflow of authors. One thought would be to set up a virtual drive and mount it throughout campus (and give authors instructions on how to mount it from home). Then, when the author is done composing the article, they could simply save a copy to their folder on the virtual drive. Whatever was saved in these folders would be queued for deposit in the IR. (This is already how many people use FTP/SSH to upload files to their Web space.)

"Thinking ahead, could you set up autosave in MS Word, OpenOffice, etc. to save each revision to the virtual drive? This would completely integrate deposit in the workflow, requiring no additional steps for deposit, but would need the author to approve the "final" draft for public access.”

Moving to blog reactions, Pete Sefton reported on the OR08 work on “zero click deposit” (which I should have given some credit to). He suggests (perhaps tongue in cheek) that I’m unfair in asking for more. He also linked back to earlier thoughts on Workflow 2.0:

“But in a Workflow 2.0 system the IR might not need to have a web ingest. Instead there could be an ingest rule that would fire when a certain context was detected, such as a document attaining a certain state of development, which would be reflected in it's metadata. That's the 'data as the driving force'... In this case it's metadata that's doing the driving.”

Next Pete talked about his participation in a JISC project (with Cambridge and some others):

“In TheOREM we’re going to set up ICE as a ‘Thesis Management System’ where a candidate can work on a thesis which is a true mashup of data and document aka a datument..⁠. When it’s done and the candidate is rubber-stamped with a big PhD, the Thesis management system will flag that, and the thesis will flow off to the relevant IR and subject repositories, as a fully-fledged part of the semantic web, thanks to embedded semantics and links to data.”

“One more idea: consider repository ingest via Zotero or EndNote. In writing this post I just went to the OR08 repository and used Zotero to grab our paper from there so I could cite it here. It would be cool to push it to the USQ ePrints with a single click.”

Then under the heading of “Net-benefit repository workflows” he cautions

“I think it is a mistake to focus on the Negative Click repository as a goal and assume that it’s going to happen without changing user behavior…
If we can seed the change by empowering some researchers to perform much better – to more easily write up their research, to begin to embed data and visualizations, to create semantic-web-ready documents – then their colleagues will notice and make the change too.”

Les Carr had (as usual) some sensible remarks:

“I think I would rather talk about value or profit - the final outcome when you take the costs and benefits into consideration. Do you run a positive value repository? Is it frankly worth the effort? Are your users in scholarly profit, or are you a burden on their already overtaxed resources?”

Les was a bit worried I had taken the Ulysses Acqua scenario too literally. Don't worry, Les, I do get it as fiction, just a particularly apposite one in many cases, if not for your repository. Les concludes

“Anyway, I think that I am in violent agreement with Chris, so to show solidarity I will do what he asked and list some positive value generators: publicity and profile (CVs, Web pages, displays, adverts for MScs/PhDs/staff), community discovery, laptop backup and asset management.”

Thanks, Les.

Meanwhile Chris on the Logical Operator has a long, thoughtful post. Go read it! Of the “negative click repository idea, he writes:

“It’s a laudable goal, but I’m not sure how one would, in practice, implement such a thing, since even the most transparent approach has a direct impact on workflow (the exact thing that most potential contributors appear not to want to mess with - they don’t want to change the way they work). The challenge is to create something that:

* Integrates with any workflow
* Involves no (or essentially no) extra activities on the part of the scholar
* Uses the data structure of the material to create the interrelations critical to really opening access
* Provides a visible and obvious value-add that can be explained without looking under the hood
* Is not just a wrapper on some existing tool”

Chris also likes the idea of the “scholarly workbench” approach, and thinks this might include:

“Storage: the system (or aggregate) would need to store copies of the relevant digital object in such a way that it could be reliably retrieved by any user. Most any system - Writeboard, SlideShare, even wiki software have this capability.

Interoperation: the system would need to allow other systems to locate said material by hooks into the system. Many W2.0 applications allow access directly to their content via syndication.

Metadata: tagging is nearly universal across W2.0 applications - it’s how people find things, the classic “tag-based folksonomy” approach.

Sharing: again, this is the entire point of most W2.0 applications. It’s not fun if you can’t share stuff. Community is the goal of most of the common web applications.

Ease of use: typically, Web 2.0 applications are designed to get you up and running quickly. You do very little to establish your account, and there are no complex permissions or technical fiddling to be done.”

Last but one (for this post, anyway), Nico Adams has some more thoughts on Staudinger’s Molecules. Again, the whole post is worth a read; I have not yet looked at the piece of technology he is playing with:

“I have been wondering for a while now, whether the darling of the semantic web community, namely Radar Network’s Twine, could not be a good model for at least some of the functionality that an institutional repo should/could have. It takes a little effort to explain what Twine is, but if you were to press me for the elevator pitch, then it would probably be fair to say that Twine is to interests and content, what Facebook is to social relationships and LinkedIn to professional relationships.”

There’s a lot more about Twine, including some screen shots, so go have a read.

Finally, a post that was not a reaction, but might have been. I should have given credit to Andy Powell, who has been questioning the current implementations of repositories (despite being joint author of the JISC Repositories Roadmap [.doc]). Andy wants repositories to be more consistent with the web architecture. He spoke at a Talis workshop recently; his slides are here (on Slideshare, one of his models for a repository). Cameron Neylon on Science in the Open was also at the event, and blogged about Andy’s talk. He also wrote:

“But the key thing is that all of this [work to deposit]should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the Wordpress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.”

Well Cameron, we want your content! So overall, there’s something here. I like the idea of positive value better than negative clicks, but I haven’t got a neat handle on it yet. Scholar’s workbench doesn’t really thrill me; I see what they mean, but the term is too broad, and scientists don’t always respond to the scholar term. I’ll keep thinking!

Tuesday, 10 June 2008

Negative click repositories

I wanted to write a bit more about this idea of a negative click repository (negative cost was a bad name, as there is a real positive $ and £ cost to the repository, rather than the depositor). First some ancient history...

When I joined the University of Glasgow in 2000, the Archives and Business Records Centre with other collaborators within the University were near the end of a short project on Effective Records Management (ERM, http://www.gla.ac.uk/infostrat/ERM/). During the course of that project, they surveyed committee clerks (who create many authoritative institutional records) on how much effort they were willing to put in, how many clicks they were willing to invest, to create records that would be easily maintainable in the digital era. The answer was: zero, none, nada! Rather than give up at this point, the team went on to create CDocS (http://www.gla.ac.uk/infostrat/ERM/Docs/ERM-Appendix2.pdf), an instrumented addition to MS Word that allowed the committee clerks to create their documents in university standard forms, with agreed metadata, with the documents and metadata automatically converted into XML for preservation and to HTML for display and sharing. ICE (see below) might be a contemporary system of a related kind, in a slightly different area. Thanks to James Currall for updating me on ERM and CDocS.

In April 2007, Peter Murray Rust had an epiphany thinking about repositories on the road to Colorado, realising that SourceForge was a shared repository that he had been using for years, and speculating that it might be used for writing an article. The tool for control of managing versions and sharing in SourceForge is SVN… Peter wrote about the complex workflow in writing a collaborative article, but then wrote:

“BUT using SVN it’s trivial - assuming there is a repository. So we do not speak of an Institutional Repository, but an authoring support environment (ASE or any other meaningless acronym. ) A starts a project in institutional SVN. B joins, so do C, D, E, etc. They all edit the m/s. Everyone sees the latest version. The version sent to the publisher is annotated as such (this is trivial). All subsequent stuff is tracked automatically. When the paper is published, the institution simply liberates the authorised version - the authors don’t even need to be involved. The attractive point of this - over simple deposition - is that the repository supports the whole authoring process.”

Many of those who left comments disagreed that the technology would work directly as suggested, for various reasons. Google Docs was mentioned as an alternative (still flawed). Peter Sefton mentioned ICE (and Murray Rust subsequently visited USQ to work briefly with the ICE team.

You may also remember Caveat Lector’s series of personae representing stakeholders in the repository game at fictional Achaea University, that I reported on before. Ulysses Acqua was her repository manager, and here’s a quote from his attempts to explain the advantages of his repository to faculty; they ask:

“Can it produce CVs or departmental-activity reports automatically? No. Can it be tweaked so that the Basketology collection looks like the Basketology website? No. (The software can do that, in fact, but Ulysses can’t.) Can it talk to campus IT’s file-storage-cum-website servers? No. Can it harvest faculty articles from disciplinary repositories? No. Can it deliver records straight to Achaea Library’s catalogue? No. Can it have access controls per item, such that items are shared with specific people only, with the list controlled by the depositor? No. Can it embargo items, for a certain length of time or indefinitely? No. Can it read a citation, check rights on the journal, and go fetch the paper if rights are cleared? Dream on. Can it restrict items for campus-only access by IP address? No. Does it talk to RefWorks and Zotero and similar bibliographic managers? No. Does it do version control? No.”

The problem here is that the repository adds work, it doesn’t take it away (there are other examples of this in some of the other personae). And overloaded people don’t accept extra work. They may promise to, but they (mostly) don’t do it.

Finally, I posted earlier on Nico Adam’s comments on repositories for scientists. He got stuck, he said: "I had to explain what a repository is from the point of view of functionality - when talking to non-specialist managers, it is the only way one can sell and explain these things…they do not care about the technological perspective…the stuff that’s under the hood. I found it impossible to explain what the value proposal of a repository is and how it differentiates itself in terms of its functionality from, say, the document management systems in their respective companies." It’s worth repeating a couple of extracts from his conclusions:

"Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. "

And further:

"Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. … Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!"

So we need to develop repositories that make the data work to take human work away: negative click repositories. Or maybe not… (sorry about this). I’m just a bit concerned that we might get a set of monumental edifices built on DSpace or ePrints.org foundations, resulting in “take it or leave” it decisions. Institutional information environments are highly tailored (not always carefully) and at the department level even more eclectic, and things have to fit together. Maybe as Nico was suggesting, what we need is an array of tools, connected together by technologies like Atom/RSS and/or OAI-ORE that can be configured so as to link the components into an information management system that works to reduce the publishing effort on campus, and captures the intellectual product on the way.

People appear to have lots of ideas on what negative click repositories might do. We could have tools for supporting the scientist in their information sharing (web sites, bibliographies and CVs). Tools for shared data management. Tools for shared writing. Tools even for the library to support faculty in dealing with publishers. And of course tools to help management count their beans. I’d like to begin to collect and order these and other suggestions if possible, so please leave comments or tag your blogs with “negative click”…

Wednesday, 4 June 2008

The negative cost repository, and other archive services

I've been at a meeting of research libraries here in Philadelphia these past two days; a topic that came up a bit was the sorts of services that libraries might offer individuals and research groups in managing their research collections. I was reminded about my post about internal Edinburgh proposals for an archive service, last year. Subsequent to that it struck me that there is quite a range of services that could be offered by some combination of Library and IT services; I mentioned some of these, and there seemed to be some resonance. There could well be more, but my list included:

a managed current storage system with "guaranteed" backup, possibly related to the unit or department rather than individual
a "bit bucket" archive for selected data files, to be kept in some sense as a record (perhaps representing some critical project phase) for extended periods, probably with mainly internal access (but possibly including access by external partners, ie "semi-internal"). Might conflate to...
a data repository, which I would see as containing all or most data in support of publication. This would need to be static (the data supports the publication and should represent it), but might need to have some kind of managed external access. This might extend to...
a full-blown digital preservation system, ie with some commitment to OAIS-type capabilities, keeping the data usable. As well as that we have the now customary (if not very full)...
publications repository, or perhaps this might grow to be...
a managed publications system providing support for joint development of papers and support for publication submission, and including retention & exposure of drafts or final versions as appropriate.

I really like the latter idea, which I have seen various references to. Perhaps we could persuade people to deposit if the cost of deposit was LESS than the cost of non-deposit. The negative-cost repository, I like that!

Monday, 21 April 2008

Institutional Repository Checklist for Serving Institutional Management

DCC News (http://www.dcc.ac.uk/, news item visible on 21 April 2008) draws our attention to this interesting paper:

Comments are requested on a draft document from the presenters of the "Research Assessment Experience" session at the EPrints User Group meeting at OR08. The "Institutional Repository Checklist for Serving Institutional Management" lists 13 success criteria for repository managers to be aware of when making plans for their repositories to provide any kind of official research reporting role for their institutional management.

Find out more (Note there are at least 3 versions, so comments have already been incorporated).

I liked this: "The numeric risk factors in the second column reflect the potential consequences of failure to deliver according to an informal nautical metaphor: (a) already dead in the water, (b) quickly shipwrecked and (c) may eventually become becalmed."

This kind of work is important; repositories have to be better at being useful tools for all kinds of purposes before they will become part of the researcher's workflow...

Wednesday, 16 April 2008

Thoughts on conversion issues in an Institutional Repository

A few people from a commercial repository solution visited UKOLN last week to talk about their brand and the services they offer. This was a useful opportunity to explore the issues around using commercial repository solutions rather than developing a system in-house, which is where most of my experience with institutional repositories has lain to date.

One thing that particularly caught my interest was the fact that their service accept deposits from authors in the native file format. These are then converted to an alternative format for dissemination. In the case of text documents, for example, this format is PDF. The source file is still retained, but it’s not accessible to everyday users. The system is therefore storing two copies of the file – the native source and a dissemination (in this case PDF) representation.

This is a pretty interesting feature and one that I haven’t come across much so far in other institutional repository systems, particularly for in-house developments. But, as with every approach, there are pros and cons. So what are the pro’s? Well, as most ‘preservationists’ would agree, storage of the source file is widely considered to be A Good Thing. We don’t know what we’re going to be capable of in the future, so storing the source file enables us to be flexible about the preservation strategy implemented, particularly for future emulations. Furthermore, it can also be the most reliable file from which to carry out migrations: if each iterative migration results in a small element of loss, then the cumulative effects of migrations can make small loss into big loss; the most reliable file from which to start a migration and minimise the effect of loss should therefore be the source file.

However, there are obvious problems (or cons) in this as well. The biggest contender is the simple problem of technological obsolescence that we’re trying to combat in the first place. Given that our ability to reliably access the file contents (particularly proprietary ones) is under threat with the passage of time, there’s no guarantee that we’ll be able to carry out future migrations from the source file if we don’t also ensure we have the right metadata and retain ongoing access to appropriate and usable software (for example). And knowing which software is needed can be a job in itself – just because a file has a *.doc suffix doesn’t mean it was created using Word, and even if it was, it’s not necessarily easy to figure out a) which version created the file and b) what unexpected features the author has included in the file that may impact on the results of a migration. This latter point is an issue not just in the future, but now.

Thinking about this led me to consider the issue of responsibility. It’s not unreasonable to think that by accepting source formats and carrying out immediate conversions for access purposes, the repository (or rather the institution) is therefore assuming responsibility for checking and validating conversion outcomes. If it goes wrong and unnoticed errors creep in from such a conversion, is the provider (commercial or institutional) to blame for potentially misrepresenting academic or scientific works? Insofar as immediate delivery of objects from an institutional repository goes, this should - at the very least - be addressed in the IR policies.

It’s impossible to do justice to this issue in a single blog post. But these are interesting issues – not just responsibility but also formats and conversion – that I suspect we’ll hear more and more about as our experience with IRs grows.

Friday, 4 April 2008

Adding Value through SNEEPing

I'm really quite taken by the SNEEP plug-in for ePrints that was showcased at OR08. It enables users to add their comments/annotations to an item stored in an eprints repository. These annotations are then made public and items can have numerous annotations added by different users. I don't know if there's a limit...? This is a great example of one way that value can be added to data collections in the context of digital curation - though admittedly the value of the added comments and annotations will be debatable! SNEEP can be downloaded from the SNEEP eprints installation at ULCC.

Friday, 20 July 2007

Subject "versus" institutional repositories

There's a concept in maths called "closed but unbounded". I'm not sure it's exactly to the point (I hope that's a pun), but "subjects" seem a bit like that. You can be pretty sure about most of the stuff that's not in a subject (or "domain"), and most of the stuff that is in it, but you can be very puzzled about some of the edges, and can find yourself in some extremely surprising discussions at times about parts of subjects that challenge most of the ideas you had. So subjects turn out to be very un-bounded. (They also tend to fracture, productively.) Perhaps not surprisingly, subjects don't tend to have assets, bank balances, etc. You might say, in those senses, subjects don't exist! They do nevertheless have very real approaches, common standards, ontologies, methods, vocabularies, literatures... and passionate adherents spread across institutions.

Institutions on the other hand, or at least universities, tend to be very material. They do have assets, bank balances, policies, libraries, employees, continuity on a significant scale, even (in the US at least) endowments. They have temporal stability and mass. They collect scholars and scientists in various domains... even if the scientists give their loyalty to their subjects, and are held together only by salaries and a common loathing of the university car parking policy!

Institutions have continuity, and they have libraries, and archives, which in serious ways express that continuity. Libraries are not about print. Libraries are now squarely about knowledge and information expressed in data, whether they know it or not. And the continuity of valuable data is an important reason for libraries to be involved.

But institutions are generic, and libraries are generic, even in more focused institutions like MIT. The library, the archive, the IR, in different ways, are about collecting elements of the scholarly discourse that contribute both globally and locally. So institutional repositories are about generic continuity of data, as libraries are about continuity of collections. IRs create value for the institution, even if it is only a small piece of value (like most other individual "collections" in an institution). If you don't play, you aren't in the game. You know data has value, just not which bits. You need to disclose your scholarly assets, across the spectrum; you can feel proud of doing so, and make a case for local benefit at the same time. You are an institution taking part in a global system; the value may be in the network, but you are part of that network.

But the way an IR treats data is necessarily generic; if you get data from chemistry, engineering, social sciences and performing arts into this under-funded but potentially valuable repository, you will do your best but it will necessarily be variants of generic practice, at best.

So back to the "subject"; if there is a data repository here, it is likely staffed by "domain experts", capable of taking on a "community proxy" role. They know their stuff. They will treat their data in domain-specific ways; they will know where to seek out data to complement their collection, they will know how to make connections between different parts. They can describe it appropriately, they can develop standards with their colleagues. They will know how to help their colleague scientists extract maximum value. Some subject repository managers are seriously concerned about the problems for disciplines if institutional repositories expand into the data "space".

What subject repositories don't usually have, is what institutions have: substantial assets, endowments, bank balances, tenured staff. Usually based around multiple project grants, 5-year core funding is a prized goal at a price of cheese-paring funding, and mid-term reviews every second year. Subject repositories don't have assured continuity, temporal mass.

The NSB LLDDC (Long-lived digital data collections) report, and now the NSF CyberInfrastructure strategy, are aimed at this area; they have spotted the fragility of these subject data collections. In the UK we have possibly even more of a patchwork of funding mechanisms than was observed in the LLDDC report. JISC used to be a significant funder of subject repositories, but in recent years has been retrenching from them, while building up massive funding in IRs. AHRC, as we have seen, is pulling back from funding the AHDS.

So what would make this better? I'd like to see a substantive discussion about the roles and funding mechanisms of subject and institutional repositories. In the UK, this would have to involve at least the Research Councils, Wellcome Trust and JISC. (Perhaps looks less likely than it did when I first wrote this.)

Secondly, I'd like to see JISC in the final tranche of its capital funding (here's the circular recently closed) explore the bounds of what's possible with the data provider/ service provider combination (maybe OAI/ORE will address this a little? Maybe not!). And what if curation is detached from the repository? What if data continuity/preservation is separated from the curation service? Do these questions even make sense?

Maybe a system or federation of sustainable IRs internally divided into sets on subject lines (and hence externally aggregatable along those lines), with subject-oriented curation activities picking up on "invisible college" volunteerism might work? Splitting curation into generic and domain elements... Or other notions, pushing the skill out into the network, the federation, but retaining the data where the assets and continuity lie?

[This posting is based on an email I sent to a closed JISC Repositories advisory group some time ago; it seems even more relevant today...]

Wednesday, 18 July 2007

Arts and Humanities Data Service... next steps?

In an earlier post, I mentioned the decision by the AHRC (and later JISC) to cease funding the
AHDS from March 2008. Since then the AHRC have re-affirmed their decision. On 28 June, the Future Histories of the Moving Image Research Network made public an open letter to the AHRC, to no avail it would appear.

In her response to the announcements, Sheila Anderson (Director of AHDS) wrote:

"In the meantime, and at least until 31st March 2008, the AHDS will continue to give advice and guidance on all matters relating to the creation of digital content arising from or supporting research, teaching and learning across the arts and humanities, including technical and metadata standards and project management. If you have a data creation project, please do not hesitate to contact us for advice.

The AHDS will continue to work with those creating important digital resources to advise on the best methods for keeping these valuable resources available and accessible for the long-term in a form that encourages their further use for answering new research questions, and their use in teaching and learning. This advice will include exploring with content creators and owners suitable repositories in which they might deposit their materials for long term curation and preservation, and how to ensure that their materials can continue to be discovered and used by the wider community. If you are currently in negotiation with the AHDS to deposit your digital collection, please continue to work with us to ensure the future sustainability and accessibility of your resource.

The AHDS will continue to make available its rich collection of digital content for use in research, teaching and learning, and to preserve those collections in its care. The AHDS intends to discuss with the JISC and the AHRC the long term future of these collections beyond April 2008 with the intention of securing their continued preservation and availability."

This is a great expression of commitment, and deserves our support. However, the lack of long-term funding must raise questions of sustainability.

Explicit in the AHRC's decision was the view that the community is mature enough to manage its own resources. There is doubt in many people's minds about this, but we are effectively stuck with it. So what are the implications? There are implications both for existing collections and for future arts and humanities resources. I would like to spend a few paragraphs thinking about the existing collections.

AHDS is not monolithic; it is comprised of several separate services (I suspect in what follows I may be using historical rather than current names). We already know that AHRC is privileging the Archaeology Data Service (ADS), which will continue to receive some funding (and which has a diverse funding base), so their resources are presumably safe. The History Data Service (HDS) resources are embedded within the UK Data Archive (which has recently received an additional 5 years funding from ESRC); it would presumably cost more to de-accession those resources than to keep preserving them and making them available, so even if HDS can take no more resources, the existing ones should be safe. Literature, Language and Linguistics is closely related to the Oxford Text Archive; I imagine the same kinds of arguments would apply there.

I have heard suggestions that Kings College London might continue to support the AHDS Executive for a period, and it appears there are some discussions with JISC about some kind of support "to ensure the expertise and achievements of the AHDS are not lost to the community".

That leaves Performing Arts and Visual Arts; I can't even surmise what their future might be, since I don't know enough about their funding and local environment.

I appreciate that it's still early days, and no doubt crucial discussions are going on behind the scenes. But if any part of AHDS resources are in danger of loss, the resource owners need to consider plans to deal with those resources in the future. This will take some time, particularly since for more complex resources, it is clear that existing repositories are generally NOT yet adequate for purpose. I guess the picture itself will be complex; I can think of at least these categories:

Some resources still exist outside of AHDS, and no action may be needed.
Some resources will not be felt worth re-homing.
Some resources can be re-homed in the time and funding institutions have available.
Some resources should be re-homed, provided that time and funding are provided by some external source (this might be for developments on an institutional repository; it might be for work on the resource to fit a new non-AHDS environment).

I tried to find out what resources were deposited by the University of Edinburgh. It's not an easy search, but ignoring a bunch of stuff where the University was only peripherally mentioned, I came up with:

HOTBED (Handing on Tradition By Electronic Dissemination) pa-1028-1
Lemba Archaeological Project arch-279-1
Gateway to the Archives of Scottish Higher Education (GASHE) exec-1003-1
Avant-Garde/Neo-Avant-Garde Bibliographic Research Database lll-2503-1
Survey of Scottish Witchcraft, 1563-1736 hist-4667-1
National Sample from the 1851 Census of Great Britain hist-1316-1

HOTBED appears on further research to be a RSAMD project that ended in 2004; however, a Google search produced a web site (http://www.hotbed.ac.uk/) that did not appear to be functional this afternoon. GASHE still exists as a working resource, so far as I can tell, and anyway is run from Glasgow. Lemba is archaeology, and so maybe safe. The Avant-Garde bibliography still exists separately. The two history resources seem safe.

Are we OK? Is there more? Who knows! I think we need much better tools to tell what is "at risk", so that plans can start being made. Of course, this could be happening, maybe I'm just not "in the loop".

Will AHRC consider bids for funding transitional work? I certainly hope so, although I don't know how this might be done. JISC is (I believe) planning one last round of its Capital Programme. Will they include provisions to enhance repositories so as to take these more complex resources? I certainly hope so!

European e-Science Digital Repository Consultation

Philip Lord wrote to tell me that he and and Alison Macdonald are conducting a study for the European Commission “Towards a European e-Infrastructure for e-Science Digital Repositories”, (e-SciDR) – see www.e-SciDR.eu. This is a short study to summarize the situation regarding repositories in Europe and to propose policies for the Commission for repository development in Europe. As part of the study process the Commission is hosting a public consultation through a questionnaire... The letter inviting participation follows:

"Dear Sir, Dear Madam,

May I invite you, as key stakeholders, to contribute to the development of a knowledge society and digital infrastructure in Europe, by taking part in the Commission’s online public consultation on e-Science Digital Repositories which is available at http://ec.europa.eu/yourvoice/ipm/forms/dispatch?form=eSciDR."

[NB Safari on the Mac appears not to work with this questionnaire, but Firefox does.]

"This consultation forms a key part of the e-SciDR study funded by the Commission into repositories holding digital data and publications for use in the sciences (in the widest sense encompassing disciplines from the humanities and social sciences to the life sciences).

Your answers will help identify needs, priorities and opportunities which the European Union, through the Commission, can help address and drive forward in the FP7 Capacity Programme and will provide an important input to developing future policy initiatives.

I would be grateful if you could respond to the consultation by no later than 30 July 2007.

All answers will be strictly confidential and anonymised.

If you would like to receive a summary of the consultation results, please tick the corresponding box on the questionnaire.

Best regards,

Mário Campolargo

Head of Unit GÉANT & e-Infrastructure"

Wednesday, 6 June 2007

Arts and Humanities Data Service decision

On 14 May 2007, the UK Arts and Humanities Research Council (AHRC) issued a rather strange press release, announcing that “AHRC Council has decided to cease funding the Arts and Humanities Data Service (AHDS) from March 2008” and at the same time announcing a change to its conditions of grant relating to deposit, removing the condition that material be offered for deposit in AHDS, although “Grant holders must make materials they had planned to deposit with the AHDS available in an accessible depository for at least three years after the end of their grant”.

This is a strange decision from a number of points of view. The first is the way it has been announced, the short period of notice (a call for proposals was currently open, closing in June), and the sudden-death cutoff of funds. Yes, there was a funding crisis, but the speed of the decision makes any sensible exit strategy for the AHDS difficult or impossible to work through… and it doesn’t appear that the AHRC is going to help create one.

Secondly it is strange because the Arts and Humanities (I’ll shorten to “the Arts” for simplicity) represent discipline areas that use and rely on a wide variety of resource types. Unlike many of the sciences, journals do not represent the most significant part of output or input in the Arts and Humanities. The monograph has always been important as a research output, but is now under threat through the pressure on the library cost base from the journals pricing crisis, and must change (and the discipline with it) or die. One way it is changing is to become virtual, and in the process to become richer… just the sort of complex resources the AHDS was set up to capture and curate. And at the same time, input has been a wide variety of resources, extending well beyond libraries to include museums, galleries, archives, theatres, dance halls, street corners… wherever the expressive power of human creativity is apparent. Many of the resources studied have been very poorly represented in traditional monographs, and can be much better represented for critical discussion in a rich digital form.

Thirdly it’s strange in its assumption that other “accessible depositories” (a term few of us have heard before; we think they might mean institutional repositories) can take up the missing role. One of the problems for the AHDS has always been the result of what is described in my previous paragraph: the resources offered are sometimes very complex web sites, based on (sometimes arbitrary) use of a wide variety of technologies, and the capture and curation of these has been a challenge which has required the AHDS to build up a considerable skill base. This skill base is just not available to the operators of UK institutional repositories, which so far have struggled to ingest mostly simple text objects (eprints). They will not easily fill this gap; they simply cannot ingest the sort of resources the AHDS has been dealing with, given their current and likely future resources. Only 5 of them even claim to include datasets, and few of those actually do; a slightly larger number claim to support “special” items, but again nothing complex is apparent on inspection. So the alternatives available to Arts researchers are few (or none).

Oddly the AHRC has not yet (I believe) dropped the condition that the AHDS must help applicants with a “Technical Annex”, supposed to ensure that created resources are easier for it to manage once they become available. Will they expect the AHDS to continue to do this once no longer funded by them? If not, who else will do this? Is there any conceivable way that institutional repositories could take up this role? Not obviously to me!

Fourthly, it is strange in its new assumption that 3 years post grant is a sufficient requirement: this in a discipline area that uses resources from hundreds of years ago!

Finally (for now), this decision is worrying because of the precedent it sets. Subject-based data centres have one great advantage: they can be staffed to provide domain knowledge that is essential for managing and preserving that subject or disciplines data resources. Institutional repositories by contrast are necessarily staffed by generalists, often librarians, who will struggle to deal with the wide variety of data resources they might be required to manage in widely differing ways. IRs have always had (to my mind) a sustainability advantage: they represent the intellectual output of an institution (good promotional and internal value), and they sit within the institutional knowledge infrastructure. Even if one institution decides to drop its IR, the academic knowledge base as a whole is not fatally damaged in any particular discipline. Subject data centres by contrast have always suffered from the disadvantage that their future was uncertain; their sustainability model has always been weak. Now that disadvantage has moved from potential to stark reality in the Arts and Humanities, one of the discipline areas most requiring access to a wide variety of outputs from the past.

If this decision worries you, there is a petition for UK citizens at http://petitions.pm.gov.uk/AHDSfunding/, which I would urge you to sign. We also need to think hard about ways in which we can deal with the effects of the decision. It is difficult to challenge a major funder, but there are times when it needs to be done. We have to get this right!

[I should have added that Seamus Ross, who is responsible for a small part of the AHDS at Glasgow, is a colleague on the DCC Management Team. However, this posting is made in a personal capacity; while I believe my views may be shared, I have not attempted to get agreement from colleagues on the sentiments expressed.]

JISC Repositories conference day 1

Yesterday and today I am at the JISC Repositories Conference in Manchester. This turns out to be (at least in my small section, and the two plenaries so far) a much more interesting event than I expected. There has been a useful focus on the fringes of the repository movement, such as the role of data, and as well an interesting re-exploration of what repositories are, and what they are for.

The keynote was from Andy Powell of Eduserv (his slides are available on Slideshare), talking about some work that he and Rachel Heery did for JISC on a repositories roadmap. Drawing an interesting parallel with today’s GPS systems, which will give you a new route if you go wrong, he wanted to look at the recommendations they had made, one year later, and see if they still stood up. By and large, the answer seemed to be that they did; however, he still had criticisms. Partly this related to an under-estimation of the role of technology in general and the web in particular, against an over-estimation of the role of policy and advocacy. The repository movement is much less successful than Web 2.0 style “equivalents” (sort of) like Flickr, Slideshare, Scribd etc. Why? He felt it was because the technology of the latter was so much better; it afforded (not his word) what people wanted to do in a much better way than typical repositories do. So in response he wanted us to re-think repositories in terms of the web architecture, and to re-think scholarly communication as a social activity (in Web 2.0 terms). So we must build services that users choose to use, not try to make them use services even though they don’t work well for them!

BTW I had trouble with my presentation, done on Powerpoint on a Mac; I couldn’t connect the Mac to the conference centre projector, nor could I get either of the Windows machines to read the file on my memory stick. I mention this because, inspired by Andy, I registered with Slideshare and in a matter of minutes had managed to upload the presentation and have it converted into the Slideshare system. Not until after my (Powerpoint-free) presentation however!

The second keynote was from Keith Jeffrey from STFC (was CCLRC until March 2007). He said many sensible things relating to the idea of the relationship between eprint repositories and data repositories (keep them separate, they are trying to do different things), and the relationship of both to the CRIS (which I think meant the Campus Research Information System). The latter is needed for research management and evaluation, the RAE etc. The interesting thing is that it contains a large amount of contextual information, which if captured automatically can make it much easier to create useful metadata cheaply. Building metadata capture into the end to end workflow is one of my interest areas, because the high cost of metadata is a significant barrier to the re-use of data, so I was pleased to hear this stated so coherently. Moreover he was implying that using the CERIF data model (from an EU project, I think), you can tie all these entities together in a way that makes harvesting and re-using them (a la Semantic Web) very much easier.

Two other things I’ll briefly mention. Liz Lyon spoke about her consultancy, which will be available as a report in a month or so, on Rights, Roles, Responsibilities and Relationships in dealing with data. I won’t attempt to summarise, but it looks an interesting piece of work, and one well worth watching out for (truth in advertising: Liz is a DCC colleague, based at Bath where she is also and much more importantly Director of UKOLN).

The other particularly interesting paper that really caught me (on about the 3rd hearing I guess) was Simon Coles talking about the R4L (Repository for the Laboratory) project. There are so many interesting features of this project: the use of a private repository to keep intermediate results, the use of health and safety plans to provide useful metadata, the (possibly generic) probity service where you can register your claim to a discovery, the use of both a human blogging service (so interim results can be easily discussed within the group) and a machine blogging service (so autonomous machines continue to blog their results, unattended, for perhaps days), the ability to annotate results with scribbles, sketches and other annotations, the idea of a structured data report based on a template, from which data for publication can be extracted (and which could form the basis of a citable data object), and the various sustainability options they have in place. This struck me again as one of the most interesting projects (or perhaps family of projects, along with Smart Tea and the eBank group) around.

Digital Curation Blog