Digital Curation Blog: November 2008

Sunday, 23 November 2008

Edinburgh IT futures workshop

I spent Friday morning at a workshop entitled “Research, wRiting and Reputation” at Edinburgh University. Pleasingly, data form quite a large part of the programme (although I’ll miss the afternoon talks, including Simon Coles from Southampton). First speaker is Prof Simon Tett of Geosciences talking about data curation and climate change. Problem is that climate change is slow, relative to changes in the observation systems, not to mention changes in computation and data storage. Baselines tend to be non-digital, eg 19th century observations from East India company shipping: such data need to be digitised to be available for computation. By comparison, almost all US Navy 2nd World War meteorological log books were destroyed (what a loss!). He’s making the point that paper records are robust (if cared for in archives) but digital data are fragile (I have problems with some of these ideas, as you may know: both documents and digital data have different kinds of fragility and robustness… the easy replicability of data compared with paper being a major advantage). Data should be looked after by organisations with a culture of long term curatorship: not researchers but libraries or similar organisations. The models themselves need to be maintained to be usable, although if continually used they can continue to be useful (but not for so long as the data). Not just the data, but the models need curation and provenance.

Paul Anderson of Informatics talking about software packages as research outputs. His main subject is a large scale UNIX configuration system, submitted by Edinburgh as part of the RAE. It’s become an Open Source system, so there are contributions from outside people in it as well. Now around 23K lines of code. Interesting to be reminded how long-lived, how large such projects can be, and how magically sustainable some of the underlying infrastructure appears to be. What would happen if SourceForge or equivalents failed? However, it does strike me that there are lessons to be earned by the data curation community from the software community, particularly from these large scale, long lived open source projects. Quite what these lessons are, I’m not sure yet!

Four speakers followed, three from creative arts (music composition and dance), and the fourth from humanities. The creative arts folks partly had problems with the recognition (or lack of recognition) of their creative outputs as research by the RAE panels. Their problems were compounded by a paucity of peer-reviewed journals, and the static, paper-oriented nature even of eJournals. The composer had established a specialist music publisher (Sumtone), since existing publishers are reluctant to handle electronic music. Interestingly, they offer their music with a Creative Commons licence.

The dance researchers had different problems. Complexity relates to temporality of dance. It’s also a young discipline, only 25 years or so, whereas dance is a very ancient form of art (although of course documentation of that is static). Dance notation is very specialised; many choreographers and many dancers cannot notate! Only a very small set of journals are interested, few books etc. Lots of online resources however. Intellectual property and performance rights are major issues for them, however.

The final researcher in this group was interested in integrating text with visual data and multimodal research objects. Her problems seemed to boil down to limitations of traditional journals in terms of the numbers and nature (colour, size, location etc) of images allowed; these restrictions themselves affected the nature of the argument she was able to make.

OK, these last few are at least partly concerned at the inappropriateness of static journals for their disciplines. Even 99% of eJournals are simply paper pages transported in PDF over the Internet. Why this still should be, 12 years after Internet Archaeology started publishing, as a peer-reviewed eJournal specifically designed to exploit the power of the Internet to enhance the scholarly content, beats me! Surely, just after the last ever RAE is exactly the time to create multimedia peer-reviewed journals to serve these disciplines. They’ll take time to get established, and to build up their citations and impact-factor (or equivalent). Does the new REF really militate so strongly against this? What else have I missed?

Thursday, 20 November 2008

Curation services based on the DCC Curation Lifecycle Model

I’ve had a go at exploring some curation services that might be appropriate at different stages of a research project. I thought it might also be worth trying to explore curation services suggested by the DCC Curation Lifecycle Model, which I also mentioned in a blog post a few weeks ago.

The model claims that it “can be used to plan activities within an organisation or consortium to ensure that all necessary stages are undertaken, each in the correct sequence… enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement”. So it does look like an appropriate place to start.

At the core of the model are the data of interest. These should absolutely be determined by the research need, but there are encodings and content standards that improve the re-usability and longevity of your data (an example might be PDF/A, which strips out some of the riskier parts of PDF and which should make the data more usable for longer).

Curation service: collected information on standards for content, encoding etc in support of better data management and curation.
Curation service: advice, consultancy and discussion forums on appropriate formats, encodings, standards etc.

The next few rings of the model include actions appropriate across the full lifecycle. “Description and Representation Information” is maybe a bit of a mouthful, but it does cover an important step, and one that can easily be missed. I do worry that participants in a project in some sense “know too much”. Everyone in my project knows that the sprogometer raw output data is always kept in a directory named after the date it was captured, on the computer named Coral-sea. It’s so obvious we hardly need to record it anywhere. And we all know what format sprogometer data are in, although we may have forgotten that the manufacturer upgraded the firmware last July and all the data from before was encoded slightly differently. But then 2 RAs leave and the PI is laid up for a while, and someone does something different. You get the picture! The action here is to ensure you have good documentation, and to explicitly collect and manage all the necessary contextual and other information (“metadata” in library-speak, “representation information” when thinking of the information needed to understand your data long term).

Curation service: sharable information on formats and encodings (including registries and repositories), plus the standards information above.
Curation service: Guidance on good practice in managing contextual information, metadata, representation information, etc.

Preservation (and curation) planning are necessary throughout the lifecycle. Increasingly you will be required by your research funder to provide a data management plan.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Community Watch and Participation is potentially a key function for an organisation standing slightly outside the research activity itself. The NSB Long-lived data report suggested a “community proxy” role, which would certainly be appropriate for domain curation services. There are related activities that can apply in a generic service.

Curation service: Community proxy role. Participate in and support standards and tools development.
Curation service: Technology watch role. Keep an eye on new developments and critically obsolescence that might affect preservation and curation.

Then the Lifecycle model moves outwards to “sequential actions”, in the lifecycle of specific data objects or datasets, I guess. As noted before, curation starts before creation, at the stage called here “conceptualise: Conceive and plan the creation of data, including capture method and storage options”. Issues here perhaps covered by earlier services?

We then move on to “doing it”, called here “Create and Receive”. Slightly different issues depending on whether you are in a project creating the data, or a repository or data service (or project) getting data from elsewhere. In the former case, we’re back with good practice on contextual information and metadata (see service suggested above); in the latter case, there’s all of these, plus conformance to appropriate collecting policies.

Curation service: Guidance on collection policies?

Appraisal is one of those areas that is less well-understood outside the archival world (not least by me!). You’ve maybe heard the stories that archivists (in the traditional, physical world) throw out up to 97% of what they receive (and there’s no evidence they were wrong, ho ho). But we tend to ignore the fact that this is based on objective, documented and well-established criteria. Now we wouldn’t necessarily expect the same rejection fraction to apply to digital data; it might for the emails and ordinary document files for a project, but might not for the datasets. Nevertheless, on reasonable, objective criteria your lovingly-created dataset may not be judged appropriate for archiving, or selected for re-use. My guess is that this would often be for reasons that might have been avoided if different actions had been taken earlier on. Normally in the science world, such decisions are taken on the basis of some sort of peer review.

Curation service: Guidance on approaches to appraisal and selection.

At this point, the Lifecycle model begins to take on a view as from a Repository/Data Centre. So whereas in the Project Life Course view, I suggested a service as “Guidance on data deposit”, here we have Ingest: the transfer of data in.

Curation service: documented guidance, policies or legal requirements.

Any repository or data centre, or any long-lived project, has to undertake actions from time to time to ensure their data remain understandable and usable as technology moves forward. Cherished tools become neglected, obsolescent, obsolete and finally stop working or become unavailable. These actions are called preservation actions. We’ve all done them from time to time (think of the “Save As” menu item as a simple example). We may test afterwards that there has been no obvious corruption (and will often miss something that turns up later). But we’ve usually not done much to ensure that the data remain verifiably authentic. This can be as simple as keeping records of the changes that were made (part of the data’s provenance).

Curation service: Guidance on preservation actions
Curation service: Guidance on maintaining authenticity
Curation service: Guidance on provenance

I noted previously that storing data securely is not trivial. It certainly deserves thoughtful attention. At worst, failure here will destroy your project, and perhaps your career!

Curation service: Briefing paper on keeping your data safe?

Making your data available for re-use by others isn’t the only point of curation (re-use by yourself at a later date is perhaps more significant). You may need to take account of requirements from your funder or publisher or community on providing access. You will of course have to remain aware of legal and ethical restrictions on making data accessible. The Lifecycle Model calls these issues “Access, Use and Reuse”.

Curation service: Guidance on funder, publisher and community expectations and norms for making data available
Curation service: Guidance on legal, ethical and other limitations on making data accessible
Curation service: Suggestions on possible approaches to making data accessible, including data publishing, databases, repositories, data centres, and robust access controls or secure data services for controlled access
Curation service: provision of data publishing, data centre, repository or other services to accept, curate and make available research data.

Combining and transforming data can be key to extracting new knowledge from them. This is perhaps situation-specific, rather than curation-generic?

The Lifecycle model concludes with 3 “occasional actions”. The first of these is the downside of selection: disposal. The key point is “in accordance with documented policies, guidance or legal requirements”. In some cases disposal may be to a repository, or another project. In some cases, the data are, or perhaps must be destroyed. This can be extremely hard to do, especially if you have been managing your backup procedures without this in mind. Just imagine tracking down every last backup tape, CD-ROM, shared drive, laptop copy, home computer version etc! Sometimes, especially secure destruction techniques are required; there are standards and approaches for this. I’m not sure that using my 4lb hammer on the hard drives followed by dumping in the furnace counts! (But it sure could be satisfying.)

Curation Service: Guidance on data disposal and destruction

We also have re-appraisal: “Return data which fails validation procedures for further appraisal and reselection”. While this mostly just points back to appraisal and selection again, it does remind me that there’s nothing in this post so far about validation. I’m not sure whether it’s too specific, but…

Curation service: Guidance on validation

And finally in the Lifecycle model, we have migration. To be honest, this looks to me as if it should simply be seen as one of the preservation actions already referred to.

So that’s a second look at possible services in support of curation. I’ve also got a “service-provider’s view” that I’ll share later (although it was thought of first).

[PS what's going on with those colours? Our web editor going to kill me... again! They are fine in the JPEG on my Mac, but horrible when uploaded. Grump.]

Comments on an obsolescence scale, please

Many people in this business know the famous Jeff Rothenberg quote: “It is only slightly facetious to say that digital information lasts forever--or five years, whichever comes first” (Rothenberg, 1995). Having just re-read the article, it’s clear that at that point he was particularly concerned about the fragility of storage media, something that doesn’t concern me for this post. What does concern me is the subject of much of the rest of his article: obsolescence of the data. There’s another very nice quote from the article, accepted gospel now but put so nicely: “A file is not a document in its own right--it merely describes a document that comes into existence when the file is interpreted by the program that produced it. Without this program (or equivalent software), the document is a cryptic hostage of its own encoding” (ibid).

Now the first Rothenberg quote is often assumed to apply in the second case, but there’s clear evidence emerging that data are readable well beyond 5 years. Most files on my hard disk that have been brought forward from previous computers (with very different operating systems) remain readable even after 15 years. Most, but not all; in my case, the exceptions are my early PowerPoint files, created with PowerPoint 4 on a Mac in the 1990s, unreadable with today’s Mac PowerPoint version. However, these files are not completely inaccessible to me, as a colleague with a Windows machine has a different version of PowerPoint that can read these files and save them in a format that I can handle. It’s a comparatively simple, if tedious, migration.

I have spent some time and asked in various fora for evidence of genuinely obsolete, that is completely inaccessible data (Rusbridge, 2006). There are some apocryphal stories and anecdotes, but little hard evidence so far. But perhaps complete inaccessibility isn’t the point? Perhaps the issue is more about risk to content, on the one hand, or extent of information loss on the other? Maybe this isn’t a binary issue (inaccessible or not), but a graded obsolescence scale? Is there such a thing? I couldn’t find one (there’s a hint of one in a NTIS report abstract (Slebodnick, Anderson, Fagan, & Lisez, 1998), but the text isn’t online, and it appears to be about obsolescence of people)!

So here are a couple of attempts, for comment.

Here’s what might be called a Category approach. Users would be asked to select one only from the following categories (it worries me that they are neither complete nor non-overlapping!):

A Completely usable, not deemed at risk
Open definition with no significant proprietary elements
multiple implementations, at least one Open Source
high quality treatment in secondary implementations
widespread use
B Currently usable, at risk
Closed definition, or
Open definition, but with significant proprietary elements
More than one implementation
imperfect treatment in secondary implementations
C Currently usable, at significant risk
Proprietary/closed definition
single proprietary implementation
Not widespread use
D Currently inaccessible, migration path known
Proprietary tools don't run on current environment
Potentially imperfect migration
E Completely inaccessible
no known method of extraction or interpretation

Here’s a more subjective, but perhaps more complete scale. Users are asked to select a number from 1 to 5 representing where they assess their data lies, on a scale whose ends are defined as

1 Completely usable, not deemed at risk, and
5 Completely inaccessible

And here’s a more Likert-like scale. Users are asked to nominate their agreement with a statement such as “My data are completely inaccessible”, using the Likert scale:

1 Strongly disagree
2 Disagree
3 Neither agree nor disagree
4 Agree
5 Strongly Agree

Hmmm. How could you neither agree nor disagree with that statement? “Excuse me, I haven’t a clue”?

Comments on this blog, please; but if you’re commenting elsewhere, can you use the tag “Obsolescence scale”, and maybe drop me a line? Thanks!

Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. (Accessed through Ebscohost)
Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? [Electronic Version]. Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.
Slebodnick, E. B., Anderson, C. D., Fagan, P., & Lisez, L. (1998). Development of Alternative Continuing Educational Systems for Preventing the Technological Obsolescence of Air Force Scientists and Engineers. Volume I. Basic Study.

Wednesday, 19 November 2008

Research Data Management email list created

At JISC's request, we have created a new list to support discussion amongst those interested in management of research data. The list is Research-Dataman@jiscmail.ac.uk, and can be joined at http://www.jiscmail.ac.uk/lists/RESEARCH-DATAMAN.html. The list description is:

"List to discuss the data management issues arising in and from research projects in UK Higher Education and its partners in the UK research community and internationally, established by the Digital Curation Centre on behalf of the JISC."

I was specifically asked not to include the word "curation" in the description, in case it frightened people off! And yes, it's already been hinted to me that the list-name is sexist. Honest, it really was just short for research data management. But you knew that (;-)!

Not much traffic yet, but this is the first time it has been publicised. Please spread the word!

Monday, 17 November 2008

Keeping the records of science accessible: can we afford it?

There's an excellent summary available of the 2008 conference of the Alliance for Permanent Access to the Records of Science Conference, with the above title. It was in Budapest on 4 November, 2008. I wasn't able to go, unfortunately, but it looks like it might have been pretty interesting.

Just sharing

Like several others in the blogosphere, I really enjoyed Scott Leslie’s post on EdTechBlog: Planning to Share versus Just Sharing. It is in the learning domain but I was glad that Andy Powell picked up its relevance to repositories in his follow-up on eFoundations.

Perhaps the key points that Andy picks up are

“The institutional approach, in my experience, is driven by people who will end up not being the ones doing the actual sharing nor producing what is to be shared. They might have the need, but they are acting on behalf of some larger entity.” and

“because typically the needs for the platform have been defined by the collective’s/collaboration’s needs, and not each of the individual users/institutions, what results is a central “bucket” that people are reluctant to contribute to, that is secondary to their ‘normal’ workflow…”

I’ve written before about fitting into the work flow, and endorse this concern.

I would also point to this quote:

“the whole reason many of us are so attracted to blogs, microblogs, social media, etc., in the first place is that they are SIMPLE to use and don’t require a lot of training.”

I suppose the equivalent for the repository world’s sphere of interest (usually, the scholarly article) is the academic’s web page. Somehow, for many (not all) academics, putting papers up on their web page does seem to fit in their normal work flow. They don’t worry about permissions, about the fine print of agreements they didn’t even read, they just put up their stuff to share. Is that a Good Thing? You betcha! Is it entirely a Good Thing? No, there are problems, for example those that arise in the medium term, when the academic moves, retires, or dies.

Looking at it another way, however, why doesn’t the collective repository infrastructure get credit for “Just Sharing”?

What about data? Well, getting data to be available for re-use is a Good Thing, and if there’s a simple infrastructure (eg the academic’s web page), that’s certainly better than leaving it to languish and fade away on some abandoned hard drive or (worse) CD-ROM. My feelings of concern are certainly elevated for data, however; don’t repositories, data services etc act as domain proxies in encouraging appropriate standardization? Well, I suppose they might, but maybe Leslie is right that the intra-domain pressures that would arise from Just Sharing would be even stronger. “It would be much easier for me to re-use your data (and give you a citation for it) if it were consistent with XXXX” is a stronger motivator than “as repository managers, we believe these data should be encoded with XXXX”.

Data repositories (and they can be at the lab or even personal level) should, of course, encourage the collection of contextual and descriptive information alongside the actual data. And again, their potential longevity brings advantage later, as technology change begins to affect usability of the data.

Now I guess the cheap shot here is to point to UKRDS as an example that is spending years Planning to Share, rather than Just Sharing. But what is the Just Sharing alternative to UKRDS?

Project data life course

This blog post is an attempt to explore the “life course” of an arbitrary small to medium research project with respect to data resources involved in the project. (I want to avoid the term life cycle, since we use this in relation to the actual data.)

This seems like a useful exercise to attempt, even if it has to be an over-generalisation, as understanding these interactions may help to define services that support projects in managing and curating their data better, and a number of such services are suggested here. This is about “Small Science”; James M. Caruthers, a professor of chemical engineering at Purdue University has claimed ‘Small Science will produce 2-3 times more data than Big Science, but is much more at risk’ (Carlson, 2006). Large and very large projects tend to have very particular approaches, and probably represent significant expertise; I’ll not presume to cover them.

Is this generalisation way off the mark? I’d really like to know how to improve it, accepting that it is a generalisation!

The overall life course of a project (I suggest) is roughly:

pre-proposal stage (when ideas are discussed amongst colleagues, before a general outline of a proposal is settled on), followed by
the proposal stage (constrained by funder’s requirements, and by the need to create a workable project balancing expected science outcomes, resources available, and papers and other reputational advantages to be secured). If the project is funded, there follows
an establishment phase, when the necessary resources are acquired, configured and tested, (which overlaps and merges with)
an execution phase when the research is actually done, and
a termination phase as the project draws to a close.
Somewhere during this process, one or more papers will be produced based on the research understandings gained, linked to the data collected and analysed.

My message here is that curation is most impacted by decisions in the first 3 phases, even though it requires attention throughout!

Pre-proposal stage
At the pre-proposal stage, you and your potential partners will mostly be concerned with working out if your research idea is likely to be feasible, how and with roughly what resources. During these conversations, you will need to identify any external or already existing data resources that will be required to enable the project to succeed. You should also start to identify the data that the project would create, or perhaps update, and how these data will support the research conclusions. In doing so, attention must be paid to resource constraints, not necessarily financial, but also legal and ethical issues affecting use of the data. However, discussions are more likely to be generic rather than focused! Thus begins the impact on curation…

Curation service: 5-minute curation introduction, with pointers to more information…
Curation service: Help Desk service. Ask a Curator?

Proposal stage
Apart from the obvious focus on the detailed plan for the research itself, this stage is particularly constrained by funder requirements. Increasingly, these requirements include a data management plan. This is unlikely to be a problem for large projects, which will already have spent some time (often lots of time) considering data management issues, but may well be a new requirement for smaller projects. However, forcing attention to be paid to data management is definitely a Good Thing! I’ll have a closer look at data management plans in a later blog post.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Key issues that the project should address, and which should be reflected in the data management plan, include the standards and encodings for analysing, managing and storing data. While these must be determined by a combination of local requirements, such as the availability of licences, skills and experience, and the local IT environment, they must take account of community norms and standards. While in some cases maverick approaches can lead to breakthrough science, they can also lead to silo data, inaccessible and not re-usable (sometimes not re-usable even by their creators, after a short while). This stage begins to have a significant effect on curation, as the decisions that are fore-shadowed here will have major effects later on. Repeat after me: curation begins before creation!

Curation service: Guidance on managing data for re-use in the medium to long term
Curation service: registries and repositories to support re-use, sharing and preservation
Curation service: pointers to relevant standards, vocabularies etc.

[Note, we have concerns about the scalability of the latter, which is why the DCC DIFFUSE service is being re-modelled to be more participative.]

Establishment stage
It’s during the establishment phase that the real work is in sight. In the case of a well-established laboratory or research centre, this may merely be “yet another project” to be fitted into well-known procedures, perhaps with some adaptations. But from my observation and anecdotally, there is often a high degree of individual idiosyncratic decision-making taking place here; the focus is often on “getting down to the research”. The consequences of this for curation may be very significant, but are unlikely to be apparent until much later.

The key curation aim is to set the data management plan into operation. This requires ensuring access to external data required by the project (an activity that should have been started, at least in principle, ahead of project approval). While you’re thinking of access, rights and ownership, negotiating and documenting data ownership and access agreements for the data created in the project would be a smart idea (too often ignored, or left until problems begin to arise). Of course, it’s not just rights that are at issue, but also responsibilities, particularly where data have privacy and ethical implications.

Curation service: Briefing papers on legal, ethical and licence issues.

Meanwhile a critical step is establishing the IT infrastructure. While the focus is likely to be on acquiring and testing the data acquisition and analysis equipment, software, capabilities and workflows, in practice the apparently background activities of ensuring adequate data storage, sensible disaster recovery procedures (or at the very least, proper data backup and recovery), and thoughtful data curation processes will be critical for later re-use.

Curation service: Briefing paper on Keeping your data safe?
Curation service: tools for various tasks, eg data audit, risk assessment

It’s worth remembering that there are different kinds of data storage. While you will often be most concerned with temporary or project-lifetime storage, you may need to think about storage with greater persistence. There may be a laboratory or institutional or subject repository, external database or data service that will be critical to your project; negotiating with the curation service providers associated with these services should be an important early step. In effect, right at the start you are planning for your exit! But more on this another time…

Curation service: Guidance on data repositories for projects?
Curation service: thought-stimulating discussion papers, blog posts and case studies, journal, conference, discussion fora…

Project execution
Through the lifetime of your project, your focus will rightly be on the research itself. But you should also take time out to test your disaster recovery procedures, and check on your quality assurance. You care about the quality of your materials and syntheses (or your algorithms, or your documentary sources), you need to care as much about the quality of your data, both as-observed and as-processed. You’ll be collecting a lot of data; if some support your hypothesis and some contradict it, you’ll need to ensure you are collecting (perhaps in laboratory notebooks or equivalent) sufficient provenance and context to justify the selection of the former rather than the latter.

Not sure if there is a curation service here, or if these are good practice research guides for your particular domain… but I guess the same could be said for standards and vocabularies. Question is whether domains HAVE good practice guides for data quality?

As the project goes forward, issues may arise in your original data management plan, and you should take the time to refine it (and document the changes), and your associated curation processes. If you did take idiosyncratic decisions in haste at project establishment, at some point you may need to review them. It’s a warning sign when a staffing change is followed by proposals for major architectural changes!

Finally, at some stage you should test any external deposit procedures built into your plan.

Curation service: Guidance on data deposit…

I don’t think I’m claiming that this is the limit of curation issues during project execution; rather that here the course of projects is so varied that generalisations are even less justifiable than elsewhere. Nevertheless, the analysis here should be strengthened, so input would be helpful!

Project termination stage
As the project comes towards an end, you enter another risk phase. Firstly, most management focus is likely to be on the next project (or the one after that!). Secondly, if employment is fixed term, tied to the particular project, then research staff will have been looking for other posts as the end of the project draws near, and the project may start to look denuded (and the data management and informatics folk may be the most marketable, leaving you exposed). And of course, the final reports etc cannot be written until the project is finished, by which time the project’s staffing resources will have disappeared. It’s the PI’s job to ensure these final project tasks are properly completed, but it’s not surprising that there are problems sometimes!

Curation service: Guidance on (or assistance with) evaluation

During this final phase, you must identify any data resources of continuing value (this should have occurred earlier, but does need re-visiting), and carry out the plans to deposit them in an appropriate location, together with the necessary contextual and provenance information. Easy, hey? Perhaps not: but failure here could have a negative impact on future grants, if research funders do their compliance checking properly.

Curation service: Guidance on appraisal, Guidance on data deposit.

Finally, given that data management plans are currently in their infancy, it’s worth commenting on the plan outcomes in your project final report!

Writing papers
This is not really a project phase as such, as often several papers will be produced at different stages of the project. The key issues here are:

Include supplementary data with your paper where possible
Ensure data embedded in the text are machine readable
Cite your data sources
Ensure supportive data are well-curated and kept available (preferably accessible; this is where a repository service may come in useful).

I’ll write more on this in a separate post.

Curation service: Proposals for data citation
Curation service: Suggestions for the Well-supported article

So: the message is that curation should be a constant theme, albeit as background during most of the project execution. But the decisions taken during pre-proposal, proposal and establishment phases will have a big effect on your ability to curate your data, which may affect your research results, and will certainly affect the quality and re-usability of the data you deposit.

Carlson, S. (2006, June 23). Lost in a Sea of Science Data: Librarians are called in to archive huge amounts of information, but cultural and financial barriers stand in the way. The Chronicle of Higher Education, 52, 35.

Friday, 14 November 2008

DCC White Rose day

Fresh from a fascinating day in Sheffield, organised by the DCC and the White Rose e-Science folk. Objectives for the day included building closer relationships between White Rose and DCC, helping us learn more about their approach to e-Science, identify data issues, and influence the DCC agenda!

After introductory remarks, the day started with Martin Lewis, Sheffield Librarian, talking about the UK Research Data Service (UKRDS) feasibility study. Martin was keen to emphasise the last two words; it’s important to manage expectations here, and we’re a long way from an operational service. There’ll be a conference to discuss the detailed proposals in February 2009, after which they hope to get funding for several Pathfinder projects. Martin and I argued ferociously, as we always do, but I should be clear that this has the potential to be an extremely valuable development. The really difficult conundrum is getting the balance of sustainability, scalability, persistence and domain science involvement in the curation. Oh, plus an appealing pitch to the potential funders. Not at all an easy balance!

Then we heard from Dr Darren Treanor from Leeds about their virtual slide library for histopathology; he claims it as one of the largest in the world, with over 40K slides digitised, even if only a thousand or so are fully described and publicly available. At 10 Gb or more a slide, these are not small; not just storage but bandwidth becomes a very real issue when accessing them. We had a brief discussion about technologies that optimise bandwidth for view very large or high resolution images; think Google Earth as one example, or the amazing Seadragon, created by Blaise Aguera y Arcas and now owned by Microsoft (the technology appears to be incorporated into Photosynth, but I can’t explore it, as it doesn’t work in my OS!).

Sarah Jones from the DCC and HATII at Glasgow spoke about the Data Audit Framework, now being piloted at 4 sites. This is a tool aimed at helping institutions to discover their research data resources, and begin to understand the extent to which they are at risk. More about that one later, I think!

Then we heard about 3 more very different projects, although Virtual Vellum probably had more in common with the virtual slide library than the presenter would have previously expected. This project shows digitised and cleaned up medieval manuscripts, with their TEI-based transcription in parallel. We heard an entertaining rant against SRB (wrong tool for the job, I think), but more worryingly that these projects were strongly constrained by IPR problems (yes, the digitised versions of half-millennium-old manuscripts do acquire new copyright status, and can also be ringed about with contractual restrictions). So much for open scholarship!

We also heard about the CARMEN project, aiming to support sharing and collaboration in neuroscience. Graham Pryor will be releasing a report from his involvement with CARMEN before too long, so I won’t dwell on it, other than to say it is a really good project that has thought a lot of issues through very carefully. It also has one decent and one cringeworthy acronym: MINI, for the Minimum Information for a Neuroscience Investigation, and MIAOWS, for Minimum Information About a Web Service… but maybe Frank Gibson was pulling our legs, since I didn’t find it on a Google search, nor is it in the MIBBI (Minimum Information for Biological and Biomedical Investigations) Portal!

Finally we heard about the University of York Digital Library developments; this is not so much a project as the development phases of a service. They've choesn Fedora plus Muradora as their technology platform, but it was clear again that this was a thoughtful project, recognising that there's much more to it than "build it and they will come".

This was a really interesting day; I hope we gave good value; I certainly learned a lot. Thanks everyone, especially Dr Joanna Schmidt, who organised it with Martin Donnelly.

Thursday, 6 November 2008

Data publishing and the fully-supported paper

Cameron Neylon’s Science in the Open blog is always good value. He’s been posting installments of a paper on aspects of open science, and there’s lots of good stuff there. Of course, Cameron’s focus is indeed on open science rather than data, but data form a large part of that vision. In part 3:

“Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper.”

“In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ’scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data.”

In part 4, Cameron discusses the idea of the “fully supported paper”.

“This is a concept that is simple on the surface but very complex to implement in practice. In essence it is the idea that the claims made in a peer reviewed paper in the conventional literature should be fully supported by a publically accessible record of all the background data, methodology, and data analysis procedures that contribute to those claims.” He suggests that “the principle of availability of background data has been accepted by a broad range of publishers. It is therefore reasonable to consider the possibility of making the public posting of data as a requirement for submission.”

Cameron does remind us

“However the data itself, except in specific cases, is not enough to be useful to other researchers. The detail of how that data was collected and how it was processed are critical for making a proper analysis of whether the claims made in a paper to be properly judged.”

This is the context, syntax and semantics idea in another guise. Of course, Cameron wants to go further, and deal with “the problem of unpublished or unsuccessful studies that may never find a home in a traditional peer reviewed paper.” That’s part of the benefit of the open science idea.

Wednesday, 5 November 2008

Some interesting posts elsewhere

I’m sorry for the gap in posting; I’ve been taking a couple of weeks of leave at the end of my trip to Australia. Since return I’ve been catching up on my blog reading, and there are some interesting posts around.

A couple of people (Robin Rice and Jim Downing in particular) have mentioned the post Modelling and storing a phonetics database inside a store, from the Less Talk, More Code blog (Ben O'Steen). This is a practical report on the steps Ben took to put a database into a Fedora Commons-based repository. He details the analysis he went through, the mappings he made, the approaches to capturing representation information, to making the data citable at different levels of granularity, and an interesting approach that he calls “curation by addition”, which appears to be a way of curating the data incrementally, capturing provenance information of all the changes made. It’s a great report, and I look forward to more practical reports of this nature.

Quite a different post on peanubutter (whose author might be Frank Gibson): The Triumvirate of Scientific Data discusses ideas that he suggests relate to the significant properties of science data. His triumvirate comprises

"content, syntax, and semantics, or more simply put -What do we want to say? How do we say it? What does it all mean?"

Oddly, the discussion associated with this blog post is on Friendfeed rather than associated with the blog itself. Very interesting to see the discussion recorded like that, and in the process see at least one sceptic become more convinced!

To me, there seemed to be strong resonances between his argument and some of the OAIS concepts, particularly Representation Information. However, context, syntax and semantics might be a more approachable set of labels than RepInfo!

Digital Curation Blog