Digital Curation Blog: June 2008

Monday, 30 June 2008

Repository Fringe

Robin Rice passed on this announcement about Edinburgh's newest festival, the Repository Fringe. She writes: "the event is being jointly planned by Edinburgh and Southampton to coincide with the 'preview week' before the opening of the Edinburgh Festival Fringe. In the spirit of the Fringe, it's a kind of an 'unconference' and we're encouraging people to sign up to participate in various ways on the wiki, e.g. by leading a 'performance' of some sort - a soapbox, a group improv, an Audience With, or as part of the Poster Bazaar." She also tells me that Dorothea Salo from Caveat Lector is giving the keynote. Here is the original email...

"Enjoy fresh debate and face-to-face discussion at Edinburgh's newest and entirely unofficial festival - the Repository Fringe! We would like to invite all repository researchers, managers, administrators & developers to come to Edinburgh for a 2-day workshop on 31st July - 1st August.

The event is free to attend - you just need to register and find your own way up to Edinburgh.

The event will be an informal opportunity to:
discuss new and emerging repository issues
generate new ideas and perspectives
show off some new project results
meet up with like-minded repository managers, administrators, librarians and developers
Research data, social networking, bibliographic services or desktop integration? Whatever your interest and problems, come and give a presentation, kickstart some activity or work on a demonstration.

Put this event in your diary and sign-up online to book your place. Visit the wiki to see the latest programme or register your interest.

Online Registration - www.regonline.co.uk/63349_631697J
Accomodation booking - http://www.repositoryfringe.org/accommodation.html
Website http://www.repositoryfringe.org/
Wiki - http://wiki.repositoryfringe.org/

Hope to see you in Edinburgh!

Theo Andrew
Digital Library Section
Library & Collections
The University of Edinburgh
www.era.lib.ed.ac.uk"

Blogs, forums, email lists, interactions

Perhaps slightly off topic for this blog, but germane to the Digital Curation Centre, this post records some different levels of interactions on this blog, the DCC Forum (which we are thinking of closing), and the semi-closed DCC-Associates email list.

On 20 May, 2008, Dave Thompson of the Wellcome Trust posted to the DCC Forum, asking about the relative merits of archiving RAW versus TIFF images. His post only got one response, from our own web editor, Graeme Pow. On 28 May, Maureen Pennock (one of this blog's co-authors) posted Dave's query to this blog. Someone linked to this post, but it was only to suggest that some of her readers might have an answer. Maureen had made some remarks, and the post got one further comment, prompted by the linked post (make that two comments via the blog).

On 25 June, Graeme Pow sent an email with a link to the original DCC Forum article, this time on the DCC-Associates email list. This is a closed list, but open to anyone who registers (currently free) as an Associate of the DCC (which gives some advantages for DCC Events etc). In the few days since then, this email has received 19 responses. As usual there have been some divergences; one branch has been discussing the cost of storage, for instance, which clearly has relevance to such decisions, but slightly off the original topic. In the past few days, 8 people have asked to be added to the list, and two people have asked to be taken off the list because of "high traffic" (although this is the first item to generate anything like this traffic for quite a while now).

We originally set up the Forum to deal with this "high traffic" problem. In most cases it seems to have done that rather too well; people don't "go and check" for interesting articles. I'm not sure the RSS feed gets used that much. This blog is an attempt to reach out to a wider audience and to link into the "blogosphere". It does achieve both, but at a cost: DCC Associates can no longer post to the blog (although I'm quite happy to post queries for them).

Personally, I think the success of this particular DCC-Associates post was due at least in part to its connection with an area of common experience in digital photographic images. But I have found the amount and rate of response very interesting. Of course, most readers of this blog cannot see this discussion, so I plan to ask respondents if I can quote them (selectively) in a summary blog post later.

Some background information: the DCC-Associates list has 572 members. The same people can post entries or responses to the Forum. The highest number of responses we have ever got for an entry on the Forum was 21; however most entries get a few or zero responses (out of 48 posts to one section, 17 had no responses). The highest number of responses I can find for the Digital Curation Blog was 6 plus one link, although the Negative Click Repository post got 8 links as well as 2 comments. The "Technorati Authority" of this blog is currently 26 (meaning, I believe, that 26 different blogs have linked to this one in main text rather than comments, in the past 6 months). I would guess the actual readership is less than the DCC-Associates list membership, but is more open, eg to posts being found via an Internet search.

Wednesday, 25 June 2008

The End of Science?

A provocative cover story in this month's Wired magazine...

"The End of Science: The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabyte Age."

In his editorial, Wired editor in chief Chris Anderson sketches a near-future "where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves."

This can't go unchallenged, and indeed the bulk of reader comments so far do challenge (and refute) it. My tuppenceworth is that Anderson's somewhat one-dimensional take does away with concepts of analysis, explication, extrapolation and hermeneutics, and replaces them with... nothing.

He also makes an assumption that artificial intelligence is the equal of human intelligence (it isn't), and fails to acknowledge that things don't always turn out as planned/ modelled: horse-racing springs to mind...

So what does the Digital Curation community make of this debate?

4th International Digital Curation Conference - Radical Sharing: Transforming Science?

This is a reminder that closing date for submission of full papers, posters and demos for the conference (details below) is 25 July 2008 . We invite submissons from individuals, organisations and institutions across all disciplines and domains engaged in the creation, use and management of digital data, especially those involved in the challenge of curating data in e-Science and e-Research. Templates for submissions and other details are available at http://www.dcc.ac.uk/events/dcc-2008/ .

The 4th International Digital Curation Conference will be held on 1-3 December 2008 at the Hilton Grosvenor Hotel, Edinburgh, Scotland, UK. The conference will be held in partnership with the National e-Science Centre and supported by the Coalition for Networked Information (CNI).

We are sure there are plenty of you out there with great ideas to share with colleagues; still time to get writing!

Wednesday, 18 June 2008

Do we really want repositories to be more Web2.0-like?

I’m spending some time thinking through what a negative click, positive value repository system should be like. Thanks to everyone for their comments on this idea. Various people suggested we should be more Web 2.0-like. Good idea, that’s following success. Or is it?

Just the other day I read on First Author this post about an Oxford startup, and it started me thinking.

“Founded by two Oxford students GroupSpaces.com arose from frustration at the multitude of different websites which clubs and societies at Oxford University were using to organise themselves online. GroupSpaces CEO David Langer said: ‘As a former president of two University societies I became increasingly annoyed with the mash-up of disconnected tools groups were using to manage themselves online – mailing lists on Yahoo! Groups, spreadsheets in Excel, events on Facebook, ancient websites – people were spending a disproportionate amount of time organising their groups across multiple platforms. There was a clear need to connect everything up and that’s what inspired us to create GroupSpaces.’”

Hang on, the answer to difficulties partly with Web 2.0 companies is to make another Web 2.0 company?

Now, I too get increasingly annoyed with my ventures into the Web 2.0 space. Far from seamless mashups (they used to be a good thing), I find myself managing identities… or more to the point, failing to manage identities to let me use Blogger, Flickr, Slideshare, Connotea, CiteULike, etc. One of them wants my email address, another wants a user ID (maybe better make it a new one in case they expose it), this one is happy with any old password, that one insists on a digit in (OK, that makes some sense), but blow me down, this one wants at least one alpha and two digits. And the blogosphere has been rocking with the data portability grumbles in recent months. So a lot of these sites, supposedly “at the network level” to quote Lorcan, are actually quite closed, proprietary, winner take all, naked capitalist commercial ventures!

Maybe my problems will all be solved by OpenID or something like it, and I shall emerge into the Web 2.0 sunshine, but somehow I doubt it.

OK, I’m sure these aren’t the aspects of Web 2.0 that people had in mind. But folks, when you talk about repositories being more compatible with the web architecture, more like Web 2.0, more oriented to the semantic web, can you be more specific please? What would you like to see happen? What do you mean, exactly?

Answers by comment, blog post linked to this one, email to me, or even by postcard, please!

Sunday, 15 June 2008

Reaction to Negative Click Repositories

I was very pleased with reactions to the Negative Click Repositories post. I’ll come back to the idea itself in a later post, but this just attempts to gather some of the comments together (the first few are comments to the original post, but I know my blogreader doesn’t expose those easily).

Owen from Imperial writes

“We clearly need to look at ways of adding value (for the researcher) to anything deposited, as well as minimising the barriers to deposit (perhaps following up on the suggestion of integrating 'repositories' into the authoring process).”

Chris Clouser reported they had been struggling with similar ideas:

“We tried to come up with some way to identify what it was, and eventually settled on the phrase "repository-as-scholarly-workbench." It's not perfect, but it worked for us.”

Gavin Baker wrote:

“I'm very interested in ways to integrate deposit with the normal workflow of authors. One thought would be to set up a virtual drive and mount it throughout campus (and give authors instructions on how to mount it from home). Then, when the author is done composing the article, they could simply save a copy to their folder on the virtual drive. Whatever was saved in these folders would be queued for deposit in the IR. (This is already how many people use FTP/SSH to upload files to their Web space.)

"Thinking ahead, could you set up autosave in MS Word, OpenOffice, etc. to save each revision to the virtual drive? This would completely integrate deposit in the workflow, requiring no additional steps for deposit, but would need the author to approve the "final" draft for public access.”

Moving to blog reactions, Pete Sefton reported on the OR08 work on “zero click deposit” (which I should have given some credit to). He suggests (perhaps tongue in cheek) that I’m unfair in asking for more. He also linked back to earlier thoughts on Workflow 2.0:

“But in a Workflow 2.0 system the IR might not need to have a web ingest. Instead there could be an ingest rule that would fire when a certain context was detected, such as a document attaining a certain state of development, which would be reflected in it's metadata. That's the 'data as the driving force'... In this case it's metadata that's doing the driving.”

Next Pete talked about his participation in a JISC project (with Cambridge and some others):

“In TheOREM we’re going to set up ICE as a ‘Thesis Management System’ where a candidate can work on a thesis which is a true mashup of data and document aka a datument..⁠. When it’s done and the candidate is rubber-stamped with a big PhD, the Thesis management system will flag that, and the thesis will flow off to the relevant IR and subject repositories, as a fully-fledged part of the semantic web, thanks to embedded semantics and links to data.”

“One more idea: consider repository ingest via Zotero or EndNote. In writing this post I just went to the OR08 repository and used Zotero to grab our paper from there so I could cite it here. It would be cool to push it to the USQ ePrints with a single click.”

Then under the heading of “Net-benefit repository workflows” he cautions

“I think it is a mistake to focus on the Negative Click repository as a goal and assume that it’s going to happen without changing user behavior…
If we can seed the change by empowering some researchers to perform much better – to more easily write up their research, to begin to embed data and visualizations, to create semantic-web-ready documents – then their colleagues will notice and make the change too.”

Les Carr had (as usual) some sensible remarks:

“I think I would rather talk about value or profit - the final outcome when you take the costs and benefits into consideration. Do you run a positive value repository? Is it frankly worth the effort? Are your users in scholarly profit, or are you a burden on their already overtaxed resources?”

Les was a bit worried I had taken the Ulysses Acqua scenario too literally. Don't worry, Les, I do get it as fiction, just a particularly apposite one in many cases, if not for your repository. Les concludes

“Anyway, I think that I am in violent agreement with Chris, so to show solidarity I will do what he asked and list some positive value generators: publicity and profile (CVs, Web pages, displays, adverts for MScs/PhDs/staff), community discovery, laptop backup and asset management.”

Thanks, Les.

Meanwhile Chris on the Logical Operator has a long, thoughtful post. Go read it! Of the “negative click repository idea, he writes:

“It’s a laudable goal, but I’m not sure how one would, in practice, implement such a thing, since even the most transparent approach has a direct impact on workflow (the exact thing that most potential contributors appear not to want to mess with - they don’t want to change the way they work). The challenge is to create something that:

* Integrates with any workflow
* Involves no (or essentially no) extra activities on the part of the scholar
* Uses the data structure of the material to create the interrelations critical to really opening access
* Provides a visible and obvious value-add that can be explained without looking under the hood
* Is not just a wrapper on some existing tool”

Chris also likes the idea of the “scholarly workbench” approach, and thinks this might include:

“Storage: the system (or aggregate) would need to store copies of the relevant digital object in such a way that it could be reliably retrieved by any user. Most any system - Writeboard, SlideShare, even wiki software have this capability.

Interoperation: the system would need to allow other systems to locate said material by hooks into the system. Many W2.0 applications allow access directly to their content via syndication.

Metadata: tagging is nearly universal across W2.0 applications - it’s how people find things, the classic “tag-based folksonomy” approach.

Sharing: again, this is the entire point of most W2.0 applications. It’s not fun if you can’t share stuff. Community is the goal of most of the common web applications.

Ease of use: typically, Web 2.0 applications are designed to get you up and running quickly. You do very little to establish your account, and there are no complex permissions or technical fiddling to be done.”

Last but one (for this post, anyway), Nico Adams has some more thoughts on Staudinger’s Molecules. Again, the whole post is worth a read; I have not yet looked at the piece of technology he is playing with:

“I have been wondering for a while now, whether the darling of the semantic web community, namely Radar Network’s Twine, could not be a good model for at least some of the functionality that an institutional repo should/could have. It takes a little effort to explain what Twine is, but if you were to press me for the elevator pitch, then it would probably be fair to say that Twine is to interests and content, what Facebook is to social relationships and LinkedIn to professional relationships.”

There’s a lot more about Twine, including some screen shots, so go have a read.

Finally, a post that was not a reaction, but might have been. I should have given credit to Andy Powell, who has been questioning the current implementations of repositories (despite being joint author of the JISC Repositories Roadmap [.doc]). Andy wants repositories to be more consistent with the web architecture. He spoke at a Talis workshop recently; his slides are here (on Slideshare, one of his models for a repository). Cameron Neylon on Science in the Open was also at the event, and blogged about Andy’s talk. He also wrote:

“But the key thing is that all of this [work to deposit]should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the Wordpress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.”

Well Cameron, we want your content! So overall, there’s something here. I like the idea of positive value better than negative clicks, but I haven’t got a neat handle on it yet. Scholar’s workbench doesn’t really thrill me; I see what they mean, but the term is too broad, and scientists don’t always respond to the scholar term. I’ll keep thinking!

Tuesday, 10 June 2008

Negative click repositories

I wanted to write a bit more about this idea of a negative click repository (negative cost was a bad name, as there is a real positive $ and £ cost to the repository, rather than the depositor). First some ancient history...

When I joined the University of Glasgow in 2000, the Archives and Business Records Centre with other collaborators within the University were near the end of a short project on Effective Records Management (ERM, http://www.gla.ac.uk/infostrat/ERM/). During the course of that project, they surveyed committee clerks (who create many authoritative institutional records) on how much effort they were willing to put in, how many clicks they were willing to invest, to create records that would be easily maintainable in the digital era. The answer was: zero, none, nada! Rather than give up at this point, the team went on to create CDocS (http://www.gla.ac.uk/infostrat/ERM/Docs/ERM-Appendix2.pdf), an instrumented addition to MS Word that allowed the committee clerks to create their documents in university standard forms, with agreed metadata, with the documents and metadata automatically converted into XML for preservation and to HTML for display and sharing. ICE (see below) might be a contemporary system of a related kind, in a slightly different area. Thanks to James Currall for updating me on ERM and CDocS.

In April 2007, Peter Murray Rust had an epiphany thinking about repositories on the road to Colorado, realising that SourceForge was a shared repository that he had been using for years, and speculating that it might be used for writing an article. The tool for control of managing versions and sharing in SourceForge is SVN… Peter wrote about the complex workflow in writing a collaborative article, but then wrote:

“BUT using SVN it’s trivial - assuming there is a repository. So we do not speak of an Institutional Repository, but an authoring support environment (ASE or any other meaningless acronym. ) A starts a project in institutional SVN. B joins, so do C, D, E, etc. They all edit the m/s. Everyone sees the latest version. The version sent to the publisher is annotated as such (this is trivial). All subsequent stuff is tracked automatically. When the paper is published, the institution simply liberates the authorised version - the authors don’t even need to be involved. The attractive point of this - over simple deposition - is that the repository supports the whole authoring process.”

Many of those who left comments disagreed that the technology would work directly as suggested, for various reasons. Google Docs was mentioned as an alternative (still flawed). Peter Sefton mentioned ICE (and Murray Rust subsequently visited USQ to work briefly with the ICE team.

You may also remember Caveat Lector’s series of personae representing stakeholders in the repository game at fictional Achaea University, that I reported on before. Ulysses Acqua was her repository manager, and here’s a quote from his attempts to explain the advantages of his repository to faculty; they ask:

“Can it produce CVs or departmental-activity reports automatically? No. Can it be tweaked so that the Basketology collection looks like the Basketology website? No. (The software can do that, in fact, but Ulysses can’t.) Can it talk to campus IT’s file-storage-cum-website servers? No. Can it harvest faculty articles from disciplinary repositories? No. Can it deliver records straight to Achaea Library’s catalogue? No. Can it have access controls per item, such that items are shared with specific people only, with the list controlled by the depositor? No. Can it embargo items, for a certain length of time or indefinitely? No. Can it read a citation, check rights on the journal, and go fetch the paper if rights are cleared? Dream on. Can it restrict items for campus-only access by IP address? No. Does it talk to RefWorks and Zotero and similar bibliographic managers? No. Does it do version control? No.”

The problem here is that the repository adds work, it doesn’t take it away (there are other examples of this in some of the other personae). And overloaded people don’t accept extra work. They may promise to, but they (mostly) don’t do it.

Finally, I posted earlier on Nico Adam’s comments on repositories for scientists. He got stuck, he said: "I had to explain what a repository is from the point of view of functionality - when talking to non-specialist managers, it is the only way one can sell and explain these things…they do not care about the technological perspective…the stuff that’s under the hood. I found it impossible to explain what the value proposal of a repository is and how it differentiates itself in terms of its functionality from, say, the document management systems in their respective companies." It’s worth repeating a couple of extracts from his conclusions:

"Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. "

And further:

"Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. … Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!"

So we need to develop repositories that make the data work to take human work away: negative click repositories. Or maybe not… (sorry about this). I’m just a bit concerned that we might get a set of monumental edifices built on DSpace or ePrints.org foundations, resulting in “take it or leave” it decisions. Institutional information environments are highly tailored (not always carefully) and at the department level even more eclectic, and things have to fit together. Maybe as Nico was suggesting, what we need is an array of tools, connected together by technologies like Atom/RSS and/or OAI-ORE that can be configured so as to link the components into an information management system that works to reduce the publishing effort on campus, and captures the intellectual product on the way.

People appear to have lots of ideas on what negative click repositories might do. We could have tools for supporting the scientist in their information sharing (web sites, bibliographies and CVs). Tools for shared data management. Tools for shared writing. Tools even for the library to support faculty in dealing with publishers. And of course tools to help management count their beans. I’d like to begin to collect and order these and other suggestions if possible, so please leave comments or tag your blogs with “negative click”…

Wednesday, 4 June 2008

The negative cost repository, and other archive services

I've been at a meeting of research libraries here in Philadelphia these past two days; a topic that came up a bit was the sorts of services that libraries might offer individuals and research groups in managing their research collections. I was reminded about my post about internal Edinburgh proposals for an archive service, last year. Subsequent to that it struck me that there is quite a range of services that could be offered by some combination of Library and IT services; I mentioned some of these, and there seemed to be some resonance. There could well be more, but my list included:

a managed current storage system with "guaranteed" backup, possibly related to the unit or department rather than individual
a "bit bucket" archive for selected data files, to be kept in some sense as a record (perhaps representing some critical project phase) for extended periods, probably with mainly internal access (but possibly including access by external partners, ie "semi-internal"). Might conflate to...
a data repository, which I would see as containing all or most data in support of publication. This would need to be static (the data supports the publication and should represent it), but might need to have some kind of managed external access. This might extend to...
a full-blown digital preservation system, ie with some commitment to OAIS-type capabilities, keeping the data usable. As well as that we have the now customary (if not very full)...
publications repository, or perhaps this might grow to be...
a managed publications system providing support for joint development of papers and support for publication submission, and including retention & exposure of drafts or final versions as appropriate.

I really like the latter idea, which I have seen various references to. Perhaps we could persuade people to deposit if the cost of deposit was LESS than the cost of non-deposit. The negative-cost repository, I like that!

Digital Curation Blog