Digital Curation Blog: Research Repository System

Showing posts with label Research Repository System. Show all posts

Tuesday, 30 December 2008

Gibbons, next generation academics, and ir-plus

Merrilee on the hangingtogether blog, with a catchup post about the fall CNI meeting, drew our attention to a presentation by Susan Gibbons of Rochester, on studying next generation academics (ie graduate students), as preparation for enhancements to their repository system, ir-plus.

I shared a platform with Susan only last year in Sydney, and I was very impressed with their approach of using anthropological techniques to study and learn from their user behaviour. In fact, her Sydney presentation probably had a big influence on the ideas for the Research Repository System that I wrote about earlier (which I should have acknowledged at the time).

Now their latest IMLS-funded research project takes this even further, as preparation for collaborative authoring and other tools integrated with the repository. If you're interested in making your repository more relevant to your users, the report on the user studies they have undertaken so far is a Must-Read! I don't know whether the code at http://code.google.com/p/irplus/ is yet complete...

Thursday, 21 August 2008

How to make repositories a killer app for scientists

Cameron Neylon wrote a nice post indirectly addressing the question of how Nature might make Connotea more useful. It's well worth reading for its own merits, but I was so taken by his questions, I thought they might be re-purposed to apply to repositories. As Cameron says "These are framed around the idea of reference management but the principles I think are sufficiently general to apply to most web services". The text below is Cameron's, except with chunks taken out (shown by ellipses ...) and some of my text added [in brackets] (added text may replace some of Cameron's text).

"Any tool must fit within my existing workflows. Once adopted I may be persuaded to modify or improve my workflow but to be adopted it has to fit to start with. ...
Any new tool must clearly outperform all the existing tools that it will replace in the relevant workflows without the requirement for network or social effects. Its got to be absolutely clear on first use that I am going to want to use this instead of ...[some other tool, like a lab web site]. ... And this has to be absolutely clear the first time I use the system, before I have created any local social network and before you have a large enough user base for these to be effective.
It must be near 100% reliable with near 100% uptime. ... Addendum - make sure people can easily ... download their stuff in a form that will be useful even if your service disappears. Obviously they’ll never need to but it will make them feel better (and don’t scrimp on this because they will check if it works).
Provide at least one (but not too many) really exciting new feature that makes people’s life better. This is related to #2 but is taking it a step further. Beyond just doing what I already do better I need a quick fix of something new and exciting. ...
Prepopulate. Build in publically available information before the users arrive. ..."

I think what Cameron's saying is what we all should know by now of the needs of potential repository depositors: make it easier, make it better, make it worth my while.

It's the "how" that's tricky. I don't think this is impossible, but these targets are hard to hit. It's what all the "negative click" Research Repository System attempts have all been about. They do need a re-think, however, if not to fall foul of Cameron's rule #1! So what other ideas are there out there, folks?

Tuesday, 5 August 2008

Comments on Negative Click Research Repository System

You may remember that I wrote a series of posts about a Research Repository System, aiming to improve deposits by getting repositories to do more that’s useful to the researcher. I had suggested it should contain these elements:

web orientation
researcher identity management
authoring support
object disclosure control
data management support
persistent storage
full preservation archive, and
spinoffs

There has been some interesting commentary on this idea. Some of it was in the comments to this blog, but most came through the JISC Repository and Preservation Advisory Group’s use of Ideascale to discuss issues related to a possible revision of the Repositories Roadmap. I decided to post some of the ideas from the RRS posts to the Ideascale forum, and they attracted comments, both positive and negative, and votes, and I will try to summarise key points here. Note, I have only picked out posts and comments that I consider relevant to this approach; there were many other interesting ideas and comments that you might want to read if you care about repositories.

BTW if you haven’t tried it, I was very impressed by the combination of technologies that Ideascale have put together, and the simplicity and value of its approach, akin to brainstorming but slower! Worth giving a whirl…

Overall, the thrust of the whole conversation seemed to support my feeling that repositories must serve their depositors rather than down-stream users (the institution, the department, colleagues, the public). This links strongly to the theme behind the RRS set of ideas.

The most popular “idea” on the JISC Ideascale site was Andy McGregor’s, not mine, but was related to this theme: “Define repository as part of the user’s (author /researcher/ learner) workflow” (21 voted for, 2 against, net 19). While there was general support, there was some concern as well: ojd20 wrote “I agree with this idea in the sense of ‘take account of things users need to do’, but not in the sense that we can reduce the myriad functions of an HE institution to a small set of flow charts and design repositories to fit those… ‘Familiar processes and tools’ is a much better way of expressing it [than workflow]” The other warning, from Owen Stephens, was “I'm also wary of comments elsewhere from the likes of Peter Murray-Rust on this. He warns (my interpretation) that we need to be careful of doing this as if it complicates the workflow, it just won't happen”. A. Dunning echoed a common theme in saying “Disagree - the repository is a back end data provider that should not be part of the researchers' / learners' workflow - but the service which sits on top of the repository should be part of the workflow.”

Rachel Heery showed another common concern, about possible over-complexity:

“I think the RRS you envisage sounds fantastic and would be a 'good thing', what worries me is the 'function creep' taking us a few miles on from some of the more basic, simpler 'few keystrokes' approach to repositories. The RRS sounds to have many features of a Virtual Research Environment, albeit perhaps a less data centric VRE…”

This echoed a comment from Bill Roberts on one of the original blog posts:

“Seriously, though - the system you've sketched out is quite complicated, not so much in the individual parts of it, (though several of those parts are quite substantial) but in the design vision required to connect up all those features in a way that would be easy and natural to use. Can you identify a subset of parts that would make a useful version 1?”

And Steve Hitchcock also commented on complexity to the same post:

“The irony of the negative click philosophy is that it has led you to produce a visionary but significantly more complex system.”

The surprising (to me) winner among the RRS ideas was “Allow the user fine-grained disclosure/access control to repository objects”. This gained 12 positive and 2 negative votes, for a net 10 votes, and no negative comments. I don’t have a good explanation for why this was so popular, noting that the voters were mainly repository managers or “experts”, especially since some of the corollary ideas were much less popular.

Another idea posted by Rachel Heery, not from RRS but worth taking note of (and getting 11 votes, none against) was “We shouldn't be thinking of repositories as a place.” In particular this meant “'repositories' are best viewed as a 'type' of data store supporting a variety of services, embedded in various workflows.”

I’ll include the Researcher Identity Management idea next, not for its own votes (4 in favour, 2 against), nor for the comments (of which there were none), but because of the linkages I had made between researcher identity and Current Research Information Systems (CRIS). The idea associated with CRIS was from Ian Stuart, and got 13 positive and 4 negative votes: “Repositories are dead, long live repositories” (yes, ideas are not always best titled!). He wrote

“The current repository technology is library/cataloger centric: items are uploaded (usually by a cataloger, not the author), and most of the meta-data is added by a subject specialist. In this model, the author-as-depositor is (at best) just an initiator for a deposit process.

A better solution would be to move towards a Combined [sic] Research Information System [CRIS], where the academic can organise their areas of interest [AOI]; see the research grants they have (and associate them with their AOI); lodge keep-safe copies of work-in-progress, data-sets, talks, ideas for future work, posters, etc (and associate them with grants or AOIs).

From this corpus of data, the academic can indicate what is visible locally (within the research group/department/organisation) and what is available globally... and from that "globally available" pool, an "Institutional Repository" can be assembled.”

The idea that the repository should be more web-oriented got 6 votes. There were no negative comments on the idea itself, but I drew a comparison (echoing Andy Powell) with Slideshare, and Paul Walk reacted against this: “More like the Web - yes! Like Slideshare - no!” He expands on this comment later (see below).

The idea that the repository should be more Web 2.0-like drew 6 positive but 5 negative votes, net 1 vote, possibly the most divided idea of all! I guess it’s one I was much less sure of, but included it because of comments to my earlier posts on this theme. A. Dunning wrote: “It's not so much the repository itself that needs Web2.0 but the range of the services which exploit that data.” And Paul Walk wrote:

“The repository should be able to participate in an interactive Web. It should be entirely possible for someone else to build a remote Web 2.0 service around resource exposed in my repository. This does not preclude me building such a service - if I think I have sufficient mass of interested users etc. but if I *need* to do this because only I can, then I have probably just built another silo.

This is what I meant by my assertion that, in general, Technorati offers us a better model than Slideshare. Someone can build a better, more focussed, domain specific etc. version of Technorati if there is demand for this, without needing to move or copy the resources in their 'source repositories' (blogs, in this case).”

[Paul, I have to say I’m unconvinced so far. A repository has to have the stuff, so that makes it more like Slideshare. And for various reasons, Technorati sucks (searches don’t work properly, lots of things fall off or go wrong). If you’d said Google instead, I might have got it. But searching by reference rather than by inclusion seems like going back to Z39.50 rather than search engine harvesting approaches (or even OAI-PMH). But maybe I’m resolutely getting the wrong end of this stick!]

Richard Davis wanted a foot in both camps:

“IMO, ideally a repo would be susceptible to all sorts of Flickrish RSS, Web API type manipulation - SWORD-like and who knows what else - leading to total personalisation potential of the user experience.

OTOH, providing implementers and users with an acceptable out-of-the-box UI, that might include basic implementations of some voguish features, and widgety ways to configure them - cf. Flickr or Wordpress - is no bad thing either.”

The idea of providing persistent storage got 6 positive and 2 negative votes, net 4 votes. Paul Walk was generally happy:

“I agree that persistence, more or less as you define it here, should be a core property of what I think of as a 'source repository'. In terms of ease of use: read, or 'get' access should be simple and should be provided through HTTP unless there is a very good reason not to. We should aspire to ease of use in terms of ingest and administrative activities.

In terms of synchronisation with offline storage (unless I have misunderstood your point): this is an interesting area, but I'm not sure it is a fundamental or universal function of 'the repository'. I think we are going to see a strange race between the development of mechanisms to do this (e.g. GoogleGears) and the progress in us all becoming so ubiquitously and permanently connected that we don't need this any more…”

The idea of data management support attracted 4 votes, none negative, and no comments. I’m not sure how to interpret this… but it was not the core interest of the group, I guess. I continue to think it an important missing service that falls into repository-space.

I split authoring support into two ideas: publishing and authoring. Of these, “The repository/library should provide support in the publishing process” got 6 votes, none negative. There were some concerns, eg Owen Stephens again “This needs restating
from a user perspective.” He thought there was value in “a system to help streamline the editorial/publication process is fine (if we can persuade academics to use it), but I'm not sure I'd start by building it on the repository - maybe you could, but there are other approaches as well.” In particular, he thought “I see the point about the benefits, but if we are solving problems our end users don't feel need solving then we have a much more difficult job on our hands (and perhaps not substantially different from our current problems with getting researchers to put things into our repository.”

More direct support for authoring was more divisive (5 votes for and 3 against, net 2 votes). In support, Owen Stephens wrote “The availability of research via Open Access would increase if the same systems that provide Open Access also provided, or were integrated with, tools which support the authoring process.” There was a bit of so-what from Ian Stuart: “This is a variation of what Peter Murry-Rust proposed back in '07: google-docs mixed with a CRIS.” [Yes, Ian, explicitly this idea came from some posts by PMR.] And one outright warning from forkel: “my experience is, that feature requests of this sort are exactly the ones which end up in the "users didn't know what they wanted bin" (I'm a developer).”

Finally (almost) I posted the idea that “The repository should be a full OAIS preservation system”. This turned out to be extremely unpopular, with only 2 votes for and 6 against, net -4! I was very disappointed (repositories are surely not for transient stuff that might be here today, gone tomorrow), so re-phrased it as “Repository should aspire to make contents accessible and usable over the medium term”, when again it attracted 2 votes for and this time none against. This was very late in the day, so maybe few people were coming back to vote on new ideas by then. It was clear that part of the problem was the term OAIS; my mistake for assuming in that context that people would have understood the term.

So where does this leave the Research Repository System idea? Not fully intact, I fear, but not shredded, either. Reduce complexity, think services operating on a store, take care to ensure these are services people really want, etc… It’s going to take me a while to work this through, but I will get back to you!

Wednesday, 2 July 2008

Research Repository System persistent storage

This is the seventh and last of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

At a very basic level, the RRS should provide a Persistent Storage service. Completely agnostic as to objects, Persistent Storage would provide a personal, or group-oriented (ie within the institution) or project-oriented (ie beyond the institution) storage service that is properly backed up. There’s no claim that Persistent Storage would last for ever, but it must last beyond the next power spike, virus infection or laptop loss! It has to be easy to use, as simple as mounting a virtual drive (but has to work equally easily for researchers using all 3 common OS environments). Conversely (and this isn’t easy), there must be reliable ways of taking parts of it with you when away from base, so synchronisation with laptops or remote computers is essential. It should support anything: data, documents, ancillary objects, databases, whatever you need. It’s possible that “cloud computing” eg Amazon S3, the Carmen Cloud or other GRID services might be appropriate.

[I'll include the last two shorter parts here.

The RRS might include a full-blown OAIS digital preservation archive. Not many institutions run these at this time, so although I think it should, I hesitate to suggest it must!

Some spinoffs you should get from your RRS would include persistent elements for your personal, department, group or project web pages (even the pages themselves). It should provide support for your CV, eg elements of your bibliography, project history, etc. It will provide you and your group with persistent end-points to link to. And your institution will benefit, first from the fact that it is supporting its researchers in curating their data and supporting verifiability of publication, and also benefiting from the research disclosure aspects.]

I guess that just about wraps it up. So who's going to build one?

Research Repository System data management

This is the sixth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Data management support is where this starts to link more strongly back to digital curation. Bear in mind here, this is a Research Repository System; not all of these functions, or the next group, need to be supported by anything that looks like one of the current repository implementations! I’m not quite clear on all of the features you might need here, but we beginning to talk about a Data Repository.

It is essential that the Data Management elements support current, dynamic data, not just static data. You may need to capture data from instruments, process it through workflow pipelines, or simply sit and edit objects, eg correcting database entries. Data Management also needs to support the opposite: persistent data that you want to keep un-changed (or perhaps append other data to while keeping the first elements un-changed).

One important element could be the ability to check-point dynamic, changing or appending objects at various points in time (eg corresponding to an article). In support of an article, you might have a particular subset available as supplementary data, and other smaller subsets to link to graphs and tables. These checkpoints might be permanent (maybe not always), and would require careful disclosure control (for example, unknown reviewers might need access to check your results, prior to publication).

Some parts of Data Management might support laboratory notebook capabilities, keeping records with time-stamps on what you are doing, and automatically providing contextual metadata for some of the captured datasets. Some of these elements might also provide some Health and Safety support (who was doing what, where, when, with whom and for how long).

Research Repository System object disclosure control

This is the fifth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Object disclosure control is crucial to this system working. Many digital objects in the system would be inaccessible to the general public (unless you are working in an Open Science or Open Notebook way). You need to be able to keep some objects private to you, some objects private to your project or group (not restricted to your institution, however), and some objects public. There should probably be some kind of embargo support for the latter, perhaps time-based, and/or requiring confirmation from you before release. And since some digital objects here are very likely to be databases, there are some granularity issues, where varying disclosure rules might apply to different subsets of the database. Perhaps this is getting a bit tough!

Research Repository System authoring support

This is the fourth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Authoring support should include version control, collaboration, possibly publisher liaison, and be integrated with the repository deposit process. It does need object disclosure control, see below. Version control would support ideas, working drafts, pre-prints, working papers, submitted drafts undergoing editorial changes, and refereed and published versions. Collaboration support would need to include support for multiple authors contributing document parts, and assembly of these into larger parts and eventually “complete” drafts. It should also include some kind of multiple author checkout system for updates, something like CVS or SVN, maybe a bit WIKI-like. It must support a wide choice of document editor, eg Word, OpenOffice.org, LaTeX etc (I don’t know how to combine this with the previous requirement!).

Publisher liaison is maybe controversial. But why shouldn’t the RRS staff (or your library) support you in dealing with publishers. The RRS wants your articles and your data, and should help you negotiate and reserve the rights so that they can get them. So publisher liaison would include rights negotiation, submission to the publisher on your behalf of a specific version, support through the editorial revision process, and recovery of metadata from the published version for the RRS records and your own bibliography, web page and CV. Naturally, deposit in the repository would be integrated in this workflow; you only have to authorise opening to the public, or perhaps a more restricted audience.

Research Repository System identity management

This is the third of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System (RRS). I've suggested it should contain these elements:

I didn’t want to talk about identity management so early in the piece, but it turned up in so many parts, I had to take it out and talk about it first. In a good RRS, this should Just Work, but at the moment it probably wouldn’t do what I want! I may see the RRS as a special case of an Institutional Repository (IR), but many if not most research collaborations are cross-institutional. This means that if there is to be support for cross-institutional authoring, there has to be support for members of other institutions to log in to your RRS. And this has to be seamless and easy, ie done without having to acquire new identities.

In addition, Researcher Identity should provide name control, that is, it knows who you are and will fill in a standardised version of your name in appropriate places. It should know your affiliation (institution, department/school, group, project and/or possibly work package). It might know some default tags for your work (eg Chris is normally talking about "digital curation"). However, this naming support must extend beyond your institution, so that collaborators and co-authors can be first-class users of other features. And it should relate to your (and their) standard institutional username and credentials; nothing extra to remember. This implies (I think) something like Shibboleth support.

This is getting kind of complicated, and verging towards another complex realm of Current Research[er] Information Systems (CRIS). These worthy systems also aim to make your life easier by knowing all about you, and linking your identity and work together. But they are complex, have their own major projects and standards, and have been going for years without much impact that I can see, except in a few cases. The RRS should take account of EuroCRIS and CERIF (see Wikipedia page) as far as they might apply.

Research Repository System web orientation

This is one of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

By web orientation, I mean that the RRS should be Web 2.0-like, being user-centred in that it knows who you are, and exploits that knowledge, and is very easy to use. It should be oriented towards sharing, but (as you’ll see) in a very controlled way; that is items should be sharable with varying groups of people from close colleagues, through unkown reviewers, to the general public. I think simple tagging rather than complex metadata should be supported. Maybe some kind of syndication/publishing support like RSS or Atom. The RSS should have elements with first class Semantic Web capabilities, supporting RDF. And because research and education environments are so very varied, and so highly tailored to individual circumstances (even if this is only based on the personal preferences of the PhD student who left a few years ago!), the RRS should be highly configurable, based on substitutable components, and able to integrate with any workflow.

Although the idea is linked in my mind to institutional or laboratory repositories, there are aspects that seem to me to require a service at above the institutional level (where researchers from different institutions work together, for example). It could be that some genuine Web 2.0 entrepreneurship might provide this!

Negative Click, Positive Value Research Repository Systems

I promised to be more specific about what I would like to see in repositories that presented more value for less work overall, by offering facilities that allow it to become part of the researcher’s workflow. I’m going to refer to this as “the Research Repository System (RRS)” for convenience.

At the top of this post is a mind map illustrating the RRS. A more complete mind map (in PDF form) is accessible here.

The main elements that I think the RRS should support are (not in any particular order):

Here’s a quick scenario to illustrate some of this. Sam works in a highly cross-disciplinary laboratory, supported by a Research Repository System. Some data comes from instruments in the lab, some from surveys that can be answered in both paper and web form, some from reading current and older publications. All project files are kept in the Persistent Storage system, after the disaster last year when both the PIs lost their laptops from a car overseas, and much precious but un-backed-up data were lost. The data are managed through the RRS Data Management element, and Sam has requested a checkpoint of data in the system because the group is near finalising an article, and they want to make sure that the data that support the article remain available, and are not over-written by later data.

Sam is the principal author, and has contributed a significant chunk of the article, along with a colleague from their partner group in Australia; colleagues from this partner group have the same access as members of Sam’s group. Everyone on the joint author list has access to the article and contributes small sections or changes; the author management and version control system does a pretty good job of ensuring that changes don’t conflict. The article is just about to be submitted to the publisher, after the RRS staff have negotiated the rights appropriately, and Sam is checking out a version to do final edits on the plane to a conference in Chile.

None of the data are public yet, but they are expecting the publisher to request anonymous access to the data for the reviewers they assign. Disclosure control will make selected check-pointed data public once the article is published. Some of the data are primed to flow through to their designated Subject Repository at the same time.

One last synchronisation of her laptop with the Persistent Storage system, and Sam is off to get her taxi downstairs…

This blog post is really too big if I include everything, so I [have released] separate blog posts for each Research Repository System element, linking them all back to this post... and then come back here and link each element above to the corresponding detailed bits.

OK I’m sure there’s more, although I’m not sure a Research Repository System of this kind can be built for general use. Want one? Nothing up my sleeve!

Digital Curation Blog