Monday, 17 November 2008

Project data life course

This blog post is an attempt to explore the “life course” of an arbitrary small to medium research project with respect to data resources involved in the project. (I want to avoid the term life cycle, since we use this in relation to the actual data.)

This seems like a useful exercise to attempt, even if it has to be an over-generalisation, as understanding these interactions may help to define services that support projects in managing and curating their data better, and a number of such services are suggested here. This is about “Small Science”; James M. Caruthers, a professor of chemical engineering at Purdue University has claimed ‘Small Science will produce 2-3 times more data than Big Science, but is much more at risk’ (Carlson, 2006). Large and very large projects tend to have very particular approaches, and probably represent significant expertise; I’ll not presume to cover them.

Is this generalisation way off the mark? I’d really like to know how to improve it, accepting that it is a generalisation!

The overall life course of a project (I suggest) is roughly:
  • pre-proposal stage (when ideas are discussed amongst colleagues, before a general outline of a proposal is settled on), followed by
  • the proposal stage (constrained by funder’s requirements, and by the need to create a workable project balancing expected science outcomes, resources available, and papers and other reputational advantages to be secured). If the project is funded, there follows
  • an establishment phase, when the necessary resources are acquired, configured and tested, (which overlaps and merges with)
  • an execution phase when the research is actually done, and
  • a termination phase as the project draws to a close.
  • Somewhere during this process, one or more papers will be produced based on the research understandings gained, linked to the data collected and analysed.
My message here is that curation is most impacted by decisions in the first 3 phases, even though it requires attention throughout!

Pre-proposal stage
At the pre-proposal stage, you and your potential partners will mostly be concerned with working out if your research idea is likely to be feasible, how and with roughly what resources. During these conversations, you will need to identify any external or already existing data resources that will be required to enable the project to succeed. You should also start to identify the data that the project would create, or perhaps update, and how these data will support the research conclusions. In doing so, attention must be paid to resource constraints, not necessarily financial, but also legal and ethical issues affecting use of the data. However, discussions are more likely to be generic rather than focused! Thus begins the impact on curation…

Curation service: 5-minute curation introduction, with pointers to more information…
Curation service: Help Desk service. Ask a Curator?

Proposal stage
Apart from the obvious focus on the detailed plan for the research itself, this stage is particularly constrained by funder requirements. Increasingly, these requirements include a data management plan. This is unlikely to be a problem for large projects, which will already have spent some time (often lots of time) considering data management issues, but may well be a new requirement for smaller projects. However, forcing attention to be paid to data management is definitely a Good Thing! I’ll have a closer look at data management plans in a later blog post.

Curation service: Data Management Plan templates
Curation service: Data Management Plan online wizards?
Curation service: consultancy to help you write your Data Management Plan?

Key issues that the project should address, and which should be reflected in the data management plan, include the standards and encodings for analysing, managing and storing data. While these must be determined by a combination of local requirements, such as the availability of licences, skills and experience, and the local IT environment, they must take account of community norms and standards. While in some cases maverick approaches can lead to breakthrough science, they can also lead to silo data, inaccessible and not re-usable (sometimes not re-usable even by their creators, after a short while). This stage begins to have a significant effect on curation, as the decisions that are fore-shadowed here will have major effects later on. Repeat after me: curation begins before creation!

Curation service: Guidance on managing data for re-use in the medium to long term
Curation service: registries and repositories to support re-use, sharing and preservation
Curation service: pointers to relevant standards, vocabularies etc.

[Note, we have concerns about the scalability of the latter, which is why the DCC DIFFUSE service is being re-modelled to be more participative.]

Establishment stage
It’s during the establishment phase that the real work is in sight. In the case of a well-established laboratory or research centre, this may merely be “yet another project” to be fitted into well-known procedures, perhaps with some adaptations. But from my observation and anecdotally, there is often a high degree of individual idiosyncratic decision-making taking place here; the focus is often on “getting down to the research”. The consequences of this for curation may be very significant, but are unlikely to be apparent until much later.

The key curation aim is to set the data management plan into operation. This requires ensuring access to external data required by the project (an activity that should have been started, at least in principle, ahead of project approval). While you’re thinking of access, rights and ownership, negotiating and documenting data ownership and access agreements for the data created in the project would be a smart idea (too often ignored, or left until problems begin to arise). Of course, it’s not just rights that are at issue, but also responsibilities, particularly where data have privacy and ethical implications.

Curation service: Briefing papers on legal, ethical and licence issues.

Meanwhile a critical step is establishing the IT infrastructure. While the focus is likely to be on acquiring and testing the data acquisition and analysis equipment, software, capabilities and workflows, in practice the apparently background activities of ensuring adequate data storage, sensible disaster recovery procedures (or at the very least, proper data backup and recovery), and thoughtful data curation processes will be critical for later re-use.

Curation service: Briefing paper on Keeping your data safe?
Curation service: tools for various tasks, eg data audit, risk assessment

It’s worth remembering that there are different kinds of data storage. While you will often be most concerned with temporary or project-lifetime storage, you may need to think about storage with greater persistence. There may be a laboratory or institutional or subject repository, external database or data service that will be critical to your project; negotiating with the curation service providers associated with these services should be an important early step. In effect, right at the start you are planning for your exit! But more on this another time…

Curation service: Guidance on data repositories for projects?
Curation service: thought-stimulating discussion papers, blog posts and case studies, journal, conference, discussion fora…

Project execution
Through the lifetime of your project, your focus will rightly be on the research itself. But you should also take time out to test your disaster recovery procedures, and check on your quality assurance. You care about the quality of your materials and syntheses (or your algorithms, or your documentary sources), you need to care as much about the quality of your data, both as-observed and as-processed. You’ll be collecting a lot of data; if some support your hypothesis and some contradict it, you’ll need to ensure you are collecting (perhaps in laboratory notebooks or equivalent) sufficient provenance and context to justify the selection of the former rather than the latter.

Not sure if there is a curation service here, or if these are good practice research guides for your particular domain… but I guess the same could be said for standards and vocabularies. Question is whether domains HAVE good practice guides for data quality?

As the project goes forward, issues may arise in your original data management plan, and you should take the time to refine it (and document the changes), and your associated curation processes. If you did take idiosyncratic decisions in haste at project establishment, at some point you may need to review them. It’s a warning sign when a staffing change is followed by proposals for major architectural changes!

Finally, at some stage you should test any external deposit procedures built into your plan.

Curation service: Guidance on data deposit…

I don’t think I’m claiming that this is the limit of curation issues during project execution; rather that here the course of projects is so varied that generalisations are even less justifiable than elsewhere. Nevertheless, the analysis here should be strengthened, so input would be helpful!

Project termination stage
As the project comes towards an end, you enter another risk phase. Firstly, most management focus is likely to be on the next project (or the one after that!). Secondly, if employment is fixed term, tied to the particular project, then research staff will have been looking for other posts as the end of the project draws near, and the project may start to look denuded (and the data management and informatics folk may be the most marketable, leaving you exposed). And of course, the final reports etc cannot be written until the project is finished, by which time the project’s staffing resources will have disappeared. It’s the PI’s job to ensure these final project tasks are properly completed, but it’s not surprising that there are problems sometimes!

Curation service: Guidance on (or assistance with) evaluation

During this final phase, you must identify any data resources of continuing value (this should have occurred earlier, but does need re-visiting), and carry out the plans to deposit them in an appropriate location, together with the necessary contextual and provenance information. Easy, hey? Perhaps not: but failure here could have a negative impact on future grants, if research funders do their compliance checking properly.

Curation service: Guidance on appraisal, Guidance on data deposit.

Finally, given that data management plans are currently in their infancy, it’s worth commenting on the plan outcomes in your project final report!

Writing papers
This is not really a project phase as such, as often several papers will be produced at different stages of the project. The key issues here are:
  • Include supplementary data with your paper where possible
  • Ensure data embedded in the text are machine readable
  • Cite your data sources
  • Ensure supportive data are well-curated and kept available (preferably accessible; this is where a repository service may come in useful).
I’ll write more on this in a separate post.

Curation service: Proposals for data citation
Curation service: Suggestions for the Well-supported article

So: the message is that curation should be a constant theme, albeit as background during most of the project execution. But the decisions taken during pre-proposal, proposal and establishment phases will have a big effect on your ability to curate your data, which may affect your research results, and will certainly affect the quality and re-usability of the data you deposit.

Carlson, S. (2006, June 23). Lost in a Sea of Science Data: Librarians are called in to archive huge amounts of information, but cultural and financial barriers stand in the way. The Chronicle of Higher Education, 52, 35.


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.