Digital Curation Blog: Workflows

Showing posts with label Workflows. Show all posts

Monday, 15 September 2008

David de Roure on "the new e-Science"

I was at the eScience All Hands meeting last week, and unfortunately missed a presentation by David de Roure on the New e-Science, an update on a talk he gave 10 months ago. The slides are available on Slideshare, but David has agreed I can share his summary:

"1. Increasing scale and diversity of participation
Decreasing cost of entry into digital research means more people, data, tools and methods. Anyone can participate: researchers in labs, archaeologists in digs or schoolchildren designing antimalarial drugs. Citizen science! Improved capabilities of digital research (e.g. increasing automation, ease of collaboration) incentivises this participation. "You're letting the oiks in!" people cry, but peer review benefits from scale of participation too. "Long Tail Science"

2. Increasing scale and diversity of data
Deluge due to new experimental methods (microarrays, combinatorial chemistry, sensor networks, earth observation, ...) and also (1). Increasing scale, diversity and complexity of digital material, processed separately and in combination. New digital artefacts like workflows, provenance, ontologies and lab books. Context and provenance essential for re-use, quality and trust. Digital Curation challenge!

3. Sharing
Anyone can play and they can play together. Anyone can be a publisher as well as a consumer - everyone's a first class citizen. Science has always been a social process, but now we're using new social tools for it. Evidenced by use of wikis, blogs, instant messaging. The lifecycle goes faster, we accelerate research and reduce time-to-experiment.

4. Collective Intelligence
Increasing participation means network effects through community intelligence: tagging, reviewing, discussion. Recommendation based on usage. This is in fact the only significant breakthrough in distributed systems in the last 30 years. Community curation: combat workflow decay!

5. Open Research
Publicly available data but also the open services and software tools of open science. Increasing adoption of Science Commons, open access journals, open data and linked data (formerly known as Semantic Web), PLoS, ... Open notebook science

6. Sharing Methods
Scripts, workflows, experimental plans, statistical models, ... Makes research repeatable, reproducible and reusable. Propagates expertise. Builds reputation. See Usefulchem, myExperiment.

7. Empowering researchers
Increasing facility with new tools puts the researchers in control - of their software/data apparatus and their experiments. Empowerment enables creativity and creation of new, sharable methods. Tools that take away autonomy will be resisted. Beware accidental disempowerment! Ultimately automation frees the researcher to do what they're best at, but can also be disempowering.

8. Better not perfect
Researchers will choose tools that are better than what they had before but not necessarily perfect. This force encourages bottom-up innovation in the practice of research. It opposes the adoption of over-engineered computer science solutions to problems researchers don't know they have and perhaps never will.

9. Pervasive deployment
Increasingly rich intersection between the physical and digital worlds through devices and instruments. Web-based interfaces not software downloads. Shift towards devices and the cloud. REST architecture coupling components that transcend their application.

10. Standing on the shoulders of giants
e-Science is now enabling researchers to do some completely new stuff! As the pieces become easy to use, researchers can bring them together in new ways and ask new questions. Boundaries are shifting, practice is changing. Ease of assembly and automation is essential."

The presentation is well worth looking at as well for the extra material David includes. I thought the more open and inclusive approach to e-Science (or Cyber-infrastructure) was well worth including here. The word "heroic" appears on his slides in relation to the Grid, which sums up my concerns, I think!

Wednesday, 2 July 2008

Research Repository System data management

This is the sixth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Data management support is where this starts to link more strongly back to digital curation. Bear in mind here, this is a Research Repository System; not all of these functions, or the next group, need to be supported by anything that looks like one of the current repository implementations! I’m not quite clear on all of the features you might need here, but we beginning to talk about a Data Repository.

It is essential that the Data Management elements support current, dynamic data, not just static data. You may need to capture data from instruments, process it through workflow pipelines, or simply sit and edit objects, eg correcting database entries. Data Management also needs to support the opposite: persistent data that you want to keep un-changed (or perhaps append other data to while keeping the first elements un-changed).

One important element could be the ability to check-point dynamic, changing or appending objects at various points in time (eg corresponding to an article). In support of an article, you might have a particular subset available as supplementary data, and other smaller subsets to link to graphs and tables. These checkpoints might be permanent (maybe not always), and would require careful disclosure control (for example, unknown reviewers might need access to check your results, prior to publication).

Some parts of Data Management might support laboratory notebook capabilities, keeping records with time-stamps on what you are doing, and automatically providing contextual metadata for some of the captured datasets. Some of these elements might also provide some Health and Safety support (who was doing what, where, when, with whom and for how long).

Negative Click, Positive Value Research Repository Systems

I promised to be more specific about what I would like to see in repositories that presented more value for less work overall, by offering facilities that allow it to become part of the researcher’s workflow. I’m going to refer to this as “the Research Repository System (RRS)” for convenience.

At the top of this post is a mind map illustrating the RRS. A more complete mind map (in PDF form) is accessible here.

The main elements that I think the RRS should support are (not in any particular order):

Here’s a quick scenario to illustrate some of this. Sam works in a highly cross-disciplinary laboratory, supported by a Research Repository System. Some data comes from instruments in the lab, some from surveys that can be answered in both paper and web form, some from reading current and older publications. All project files are kept in the Persistent Storage system, after the disaster last year when both the PIs lost their laptops from a car overseas, and much precious but un-backed-up data were lost. The data are managed through the RRS Data Management element, and Sam has requested a checkpoint of data in the system because the group is near finalising an article, and they want to make sure that the data that support the article remain available, and are not over-written by later data.

Sam is the principal author, and has contributed a significant chunk of the article, along with a colleague from their partner group in Australia; colleagues from this partner group have the same access as members of Sam’s group. Everyone on the joint author list has access to the article and contributes small sections or changes; the author management and version control system does a pretty good job of ensuring that changes don’t conflict. The article is just about to be submitted to the publisher, after the RRS staff have negotiated the rights appropriately, and Sam is checking out a version to do final edits on the plane to a conference in Chile.

None of the data are public yet, but they are expecting the publisher to request anonymous access to the data for the reviewers they assign. Disclosure control will make selected check-pointed data public once the article is published. Some of the data are primed to flow through to their designated Subject Repository at the same time.

One last synchronisation of her laptop with the Persistent Storage system, and Sam is off to get her taxi downstairs…

This blog post is really too big if I include everything, so I [have released] separate blog posts for each Research Repository System element, linking them all back to this post... and then come back here and link each element above to the corresponding detailed bits.

OK I’m sure there’s more, although I’m not sure a Research Repository System of this kind can be built for general use. Want one? Nothing up my sleeve!

Sunday, 15 June 2008

Reaction to Negative Click Repositories

I was very pleased with reactions to the Negative Click Repositories post. I’ll come back to the idea itself in a later post, but this just attempts to gather some of the comments together (the first few are comments to the original post, but I know my blogreader doesn’t expose those easily).

Owen from Imperial writes

“We clearly need to look at ways of adding value (for the researcher) to anything deposited, as well as minimising the barriers to deposit (perhaps following up on the suggestion of integrating 'repositories' into the authoring process).”

Chris Clouser reported they had been struggling with similar ideas:

“We tried to come up with some way to identify what it was, and eventually settled on the phrase "repository-as-scholarly-workbench." It's not perfect, but it worked for us.”

Gavin Baker wrote:

“I'm very interested in ways to integrate deposit with the normal workflow of authors. One thought would be to set up a virtual drive and mount it throughout campus (and give authors instructions on how to mount it from home). Then, when the author is done composing the article, they could simply save a copy to their folder on the virtual drive. Whatever was saved in these folders would be queued for deposit in the IR. (This is already how many people use FTP/SSH to upload files to their Web space.)

"Thinking ahead, could you set up autosave in MS Word, OpenOffice, etc. to save each revision to the virtual drive? This would completely integrate deposit in the workflow, requiring no additional steps for deposit, but would need the author to approve the "final" draft for public access.”

Moving to blog reactions, Pete Sefton reported on the OR08 work on “zero click deposit” (which I should have given some credit to). He suggests (perhaps tongue in cheek) that I’m unfair in asking for more. He also linked back to earlier thoughts on Workflow 2.0:

“But in a Workflow 2.0 system the IR might not need to have a web ingest. Instead there could be an ingest rule that would fire when a certain context was detected, such as a document attaining a certain state of development, which would be reflected in it's metadata. That's the 'data as the driving force'... In this case it's metadata that's doing the driving.”

Next Pete talked about his participation in a JISC project (with Cambridge and some others):

“In TheOREM we’re going to set up ICE as a ‘Thesis Management System’ where a candidate can work on a thesis which is a true mashup of data and document aka a datument..⁠. When it’s done and the candidate is rubber-stamped with a big PhD, the Thesis management system will flag that, and the thesis will flow off to the relevant IR and subject repositories, as a fully-fledged part of the semantic web, thanks to embedded semantics and links to data.”

“One more idea: consider repository ingest via Zotero or EndNote. In writing this post I just went to the OR08 repository and used Zotero to grab our paper from there so I could cite it here. It would be cool to push it to the USQ ePrints with a single click.”

Then under the heading of “Net-benefit repository workflows” he cautions

“I think it is a mistake to focus on the Negative Click repository as a goal and assume that it’s going to happen without changing user behavior…
If we can seed the change by empowering some researchers to perform much better – to more easily write up their research, to begin to embed data and visualizations, to create semantic-web-ready documents – then their colleagues will notice and make the change too.”

Les Carr had (as usual) some sensible remarks:

“I think I would rather talk about value or profit - the final outcome when you take the costs and benefits into consideration. Do you run a positive value repository? Is it frankly worth the effort? Are your users in scholarly profit, or are you a burden on their already overtaxed resources?”

Les was a bit worried I had taken the Ulysses Acqua scenario too literally. Don't worry, Les, I do get it as fiction, just a particularly apposite one in many cases, if not for your repository. Les concludes

“Anyway, I think that I am in violent agreement with Chris, so to show solidarity I will do what he asked and list some positive value generators: publicity and profile (CVs, Web pages, displays, adverts for MScs/PhDs/staff), community discovery, laptop backup and asset management.”

Thanks, Les.

Meanwhile Chris on the Logical Operator has a long, thoughtful post. Go read it! Of the “negative click repository idea, he writes:

“It’s a laudable goal, but I’m not sure how one would, in practice, implement such a thing, since even the most transparent approach has a direct impact on workflow (the exact thing that most potential contributors appear not to want to mess with - they don’t want to change the way they work). The challenge is to create something that:

* Integrates with any workflow
* Involves no (or essentially no) extra activities on the part of the scholar
* Uses the data structure of the material to create the interrelations critical to really opening access
* Provides a visible and obvious value-add that can be explained without looking under the hood
* Is not just a wrapper on some existing tool”

Chris also likes the idea of the “scholarly workbench” approach, and thinks this might include:

“Storage: the system (or aggregate) would need to store copies of the relevant digital object in such a way that it could be reliably retrieved by any user. Most any system - Writeboard, SlideShare, even wiki software have this capability.

Interoperation: the system would need to allow other systems to locate said material by hooks into the system. Many W2.0 applications allow access directly to their content via syndication.

Metadata: tagging is nearly universal across W2.0 applications - it’s how people find things, the classic “tag-based folksonomy” approach.

Sharing: again, this is the entire point of most W2.0 applications. It’s not fun if you can’t share stuff. Community is the goal of most of the common web applications.

Ease of use: typically, Web 2.0 applications are designed to get you up and running quickly. You do very little to establish your account, and there are no complex permissions or technical fiddling to be done.”

Last but one (for this post, anyway), Nico Adams has some more thoughts on Staudinger’s Molecules. Again, the whole post is worth a read; I have not yet looked at the piece of technology he is playing with:

“I have been wondering for a while now, whether the darling of the semantic web community, namely Radar Network’s Twine, could not be a good model for at least some of the functionality that an institutional repo should/could have. It takes a little effort to explain what Twine is, but if you were to press me for the elevator pitch, then it would probably be fair to say that Twine is to interests and content, what Facebook is to social relationships and LinkedIn to professional relationships.”

There’s a lot more about Twine, including some screen shots, so go have a read.

Finally, a post that was not a reaction, but might have been. I should have given credit to Andy Powell, who has been questioning the current implementations of repositories (despite being joint author of the JISC Repositories Roadmap [.doc]). Andy wants repositories to be more consistent with the web architecture. He spoke at a Talis workshop recently; his slides are here (on Slideshare, one of his models for a repository). Cameron Neylon on Science in the Open was also at the event, and blogged about Andy’s talk. He also wrote:

“But the key thing is that all of this [work to deposit]should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the Wordpress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.”

Well Cameron, we want your content! So overall, there’s something here. I like the idea of positive value better than negative clicks, but I haven’t got a neat handle on it yet. Scholar’s workbench doesn’t really thrill me; I see what they mean, but the term is too broad, and scientists don’t always respond to the scholar term. I’ll keep thinking!

Tuesday, 10 June 2008

Negative click repositories

I wanted to write a bit more about this idea of a negative click repository (negative cost was a bad name, as there is a real positive $ and £ cost to the repository, rather than the depositor). First some ancient history...

When I joined the University of Glasgow in 2000, the Archives and Business Records Centre with other collaborators within the University were near the end of a short project on Effective Records Management (ERM, http://www.gla.ac.uk/infostrat/ERM/). During the course of that project, they surveyed committee clerks (who create many authoritative institutional records) on how much effort they were willing to put in, how many clicks they were willing to invest, to create records that would be easily maintainable in the digital era. The answer was: zero, none, nada! Rather than give up at this point, the team went on to create CDocS (http://www.gla.ac.uk/infostrat/ERM/Docs/ERM-Appendix2.pdf), an instrumented addition to MS Word that allowed the committee clerks to create their documents in university standard forms, with agreed metadata, with the documents and metadata automatically converted into XML for preservation and to HTML for display and sharing. ICE (see below) might be a contemporary system of a related kind, in a slightly different area. Thanks to James Currall for updating me on ERM and CDocS.

In April 2007, Peter Murray Rust had an epiphany thinking about repositories on the road to Colorado, realising that SourceForge was a shared repository that he had been using for years, and speculating that it might be used for writing an article. The tool for control of managing versions and sharing in SourceForge is SVN… Peter wrote about the complex workflow in writing a collaborative article, but then wrote:

“BUT using SVN it’s trivial - assuming there is a repository. So we do not speak of an Institutional Repository, but an authoring support environment (ASE or any other meaningless acronym. ) A starts a project in institutional SVN. B joins, so do C, D, E, etc. They all edit the m/s. Everyone sees the latest version. The version sent to the publisher is annotated as such (this is trivial). All subsequent stuff is tracked automatically. When the paper is published, the institution simply liberates the authorised version - the authors don’t even need to be involved. The attractive point of this - over simple deposition - is that the repository supports the whole authoring process.”

Many of those who left comments disagreed that the technology would work directly as suggested, for various reasons. Google Docs was mentioned as an alternative (still flawed). Peter Sefton mentioned ICE (and Murray Rust subsequently visited USQ to work briefly with the ICE team.

You may also remember Caveat Lector’s series of personae representing stakeholders in the repository game at fictional Achaea University, that I reported on before. Ulysses Acqua was her repository manager, and here’s a quote from his attempts to explain the advantages of his repository to faculty; they ask:

“Can it produce CVs or departmental-activity reports automatically? No. Can it be tweaked so that the Basketology collection looks like the Basketology website? No. (The software can do that, in fact, but Ulysses can’t.) Can it talk to campus IT’s file-storage-cum-website servers? No. Can it harvest faculty articles from disciplinary repositories? No. Can it deliver records straight to Achaea Library’s catalogue? No. Can it have access controls per item, such that items are shared with specific people only, with the list controlled by the depositor? No. Can it embargo items, for a certain length of time or indefinitely? No. Can it read a citation, check rights on the journal, and go fetch the paper if rights are cleared? Dream on. Can it restrict items for campus-only access by IP address? No. Does it talk to RefWorks and Zotero and similar bibliographic managers? No. Does it do version control? No.”

The problem here is that the repository adds work, it doesn’t take it away (there are other examples of this in some of the other personae). And overloaded people don’t accept extra work. They may promise to, but they (mostly) don’t do it.

Finally, I posted earlier on Nico Adam’s comments on repositories for scientists. He got stuck, he said: "I had to explain what a repository is from the point of view of functionality - when talking to non-specialist managers, it is the only way one can sell and explain these things…they do not care about the technological perspective…the stuff that’s under the hood. I found it impossible to explain what the value proposal of a repository is and how it differentiates itself in terms of its functionality from, say, the document management systems in their respective companies." It’s worth repeating a couple of extracts from his conclusions:

"Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. "

And further:

"Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. … Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!"

So we need to develop repositories that make the data work to take human work away: negative click repositories. Or maybe not… (sorry about this). I’m just a bit concerned that we might get a set of monumental edifices built on DSpace or ePrints.org foundations, resulting in “take it or leave” it decisions. Institutional information environments are highly tailored (not always carefully) and at the department level even more eclectic, and things have to fit together. Maybe as Nico was suggesting, what we need is an array of tools, connected together by technologies like Atom/RSS and/or OAI-ORE that can be configured so as to link the components into an information management system that works to reduce the publishing effort on campus, and captures the intellectual product on the way.

People appear to have lots of ideas on what negative click repositories might do. We could have tools for supporting the scientist in their information sharing (web sites, bibliographies and CVs). Tools for shared data management. Tools for shared writing. Tools even for the library to support faculty in dealing with publishers. And of course tools to help management count their beans. I’d like to begin to collect and order these and other suggestions if possible, so please leave comments or tag your blogs with “negative click”…

Monday, 21 April 2008

Institutional Repository Checklist for Serving Institutional Management

DCC News (http://www.dcc.ac.uk/, news item visible on 21 April 2008) draws our attention to this interesting paper:

Comments are requested on a draft document from the presenters of the "Research Assessment Experience" session at the EPrints User Group meeting at OR08. The "Institutional Repository Checklist for Serving Institutional Management" lists 13 success criteria for repository managers to be aware of when making plans for their repositories to provide any kind of official research reporting role for their institutional management.

Find out more (Note there are at least 3 versions, so comments have already been incorporated).

I liked this: "The numeric risk factors in the second column reflect the potential consequences of failure to deliver according to an informal nautical metaphor: (a) already dead in the water, (b) quickly shipwrecked and (c) may eventually become becalmed."

This kind of work is important; repositories have to be better at being useful tools for all kinds of purposes before they will become part of the researcher's workflow...

Friday, 14 December 2007

Murray-Rust on Digital Curation Conference day 2

Peter Murray-Rust has written two blog posts (here and here; I'm not sure if those are permanent URL's...) about day 2 of the International Digital Curation Conference in Washington DC. Thanks, Peter.

In the first post, he began:

"There is a definitely an air of optimism in the conference - we know the tasks are hard and very very diverse but it’s clear that many of them are understood."

He then picked up on Carole Goble's presentation on workflows. Here are a few random extracts from Peter's random jottings (his description):

"The great thing about Carole is she’s honest. Workflows are HARD. They are expensive. There are lots of them. Not of them does exactly what you want. And so on. [PMR: We did a lot of work - by our standards - on Taverna but found it wasn’t cost-effective at that stage. Currently we script things and use Java. Someday we shall return.]"

"myExperiment.org. A collaborative site for workflows. You can go there and find what you want (maybe) and find people to talk to. “- bazaar for workflows, encapsulated objects (EMO) single WFs or collections, chemistry data with blogged log book, encapsulatd experimental objects Open Linked Data linked initiative…"

"Scientists do not collaborate - scientists would rather share a toothbrush [...] than gene names (Mike Ashburner)
who gets the credit? - who is allowed to update?. Changing metadata rather than data. Versioning. Have to get credit and reputation managed. Scientitsts are driven by money, fame, reputation, fear of being left behind"

"Annotations are first class citizens"

His second post covers Jane Hunter and Kwok Cheung's presentation on compound document objects (CDOs):

"Increasing pressure to share and publish data while maintaining competitiveness.
Main problem lack of simple tools for recording, publishing, standards.
What is the incentive to curate and deposit? What granularity? concern for IP and ownership"

"Current problems with traditional systems - little semantic relationship, little provenance, little selectivity, interactivity , flexibility and often fixed rendering and interfaces. No multilevel access. either all open or all restricted
usually hardwired presentation"

"Capture scientific provenance through RDF (and can capture events in physical and digital domain)
Compound Digital Objects - variable semantics, media, etc.
Typed relationships within the CDOs. (this is critical)"

"SCOPE [the tool Jane & Kwok have developed is] a simplified tool for authoring these objects. Can create provenance graphs. Infer types as much as possible. RSS notification. Comes with a graphical provenance explorer."

Thanks again, Peter!

Digital Curation Blog