Digital Curation Blog: data curation

Showing posts with label data curation. Show all posts

Tuesday, 8 December 2009

Leadership opportunities

Those interested in leadership in Digital and Data Curation should keep an eye on the relevant UK press and lists over the next week or so for anything of interest...

Wednesday, 18 November 2009

Workshops prior to the International Digital Curation Conference

Pre-conference workshops can be very useful and interesting; they can be a good part of the justification for attending a conference, giving an extended opportunity to focus on a single topic, followed by a broader (but shallower) look at many topics, at the conference itself. This time it is quite frustrating, as I would very much like to go to all the workshops! There is still time to register for your choice, and for the IDCC conference itself.

Disciplinary Dimensions of Digital Curation: New Perspectives on Research Data 

Our SCARP Project case studies have explored data curation practice across a variety of clinical, life, social, humanities, physical and engineering research communities. This workshop is the final event in SCARP, and will present the reports and synthesis.

See the full programme [PDF]

Digital Curation 101 Lite Training 

Research councils and funding bodies are increasingly requiring evidence of adequate and appropriate provisions for data management and curation in new grant funding applications. This one-day training workshop is aimed at researchers and those who support researchers and want to learn more about how to develop sound data management and curation plans.

See the full programme [PDF]

Citability of Research Data 

Goal: Handling research datasets as unique, independent, citable research objects offers a wide variety of opportunities.

The goal of the new DataCite cooperation is to establish a not-for-profit agency that enables organisations to register research datasets and assign persistent identifiers to them.

Citable datasets are accessible and can be integrated into existing catalogues and infrastructures. A citable datasets furthermore rewards scientists for their extra-work in storage and quality control of data by granting scientific reputation through cite-counts. The workshop will examine the different methods for the enabling of citable datasets and discuss common best practices and challenges for the future.

▪ See the full programme [PDF]

Repository Preservation Infrastructure (REPRISE) (co-organised by the OGF Repositories Group, OGF-Europe, D-Grid/WissGrid) 

Following on from the successful Repository Curation Service Environments (RECURSE) Workshop at IDCC 2008, this workshop discusses digital repositories and their specific requirements for/as preservation infrastructure, as well as their role within a preservation environment.

Wednesday, 5 August 2009

Curated databases and data curation

I've been working on an article with a colleague, and came across something that's interesting, and that you might be able to help us with. There does appear to be a distinction between the way curation is used in the bio-sciences, and elsewhere. In particular, the term "curated database" tends to mean a manually constructed database that links literature to data, curated by experts who provide authority (eg see the Wikipedia definition of Biocurator). The earliest mention of the term "curated database" I can find is in the abstract (and only in the abstract) of Larsen et al (1993).

However, the terms "digital curation" and "data curation" tend to mean something different. We in the DCC say "Digital curation is maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials". This has a lot more elements of good management about it, and less of the idea of specific curators making judgements based on the literature. It has also been rather conflated with digital preservation.

The earliest reference to digital curation I can find is a report of an invitational meeting held in October 2001, oddly titled "The Digital Curation: digital archives, libraries and e-science seminar" (Beagrie and Pothen, 2001). In the meeting there was some discussion about data curation. The earliest more formal reference to data curation I can find is a technical report from Gray, Szalay et al (2002).

So my challenge is this: are there earlier references to digital (or data) curation, of the second kind?

Beagrie, N., & Pothen, P. (2001). The Digital Curation: digital archives, libraries and e-science seminar. Ariadne. http://www.ariadne.ac.uk/issue30/digital-curation/.

Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., & vandenBerg, J. (2002). Online Scientific Data Curation, Publication, and Archiving. Redmond: Microsoft Research. http://arxiv.org/abs/cs.DL/0208012.

Larsen, N., Olsen, G. J., Maidak, B. L., McCaughey, M. J., Overbeek, R., Macke, T. J., et al. (1993). The ribosomal database project. Nucl. Acids Res., 21(13), 3021-3023. http://nar.oxfordjournals.org/cgi/content/abstract/21/13/3021

Monday, 4 May 2009

New nature corresponding author policy on data

Thanks to Michael Jubb for pointing me to the editorial outlining the new Nature policy:

"Accordingly, we have modified the Nature journal policy on authorship, which is detailed on our website (http://tinyurl.com/dkgbf8). For papers submitted by collaborations, we now delineate the responsibilities of the senior members of each collaboration group on the paper. Before submitting the paper, at least one senior member from each collaborating group must take responsibility for their group's contribution. Three major responsibilities are covered: preservation of the original data on which the paper is based, verification that the figures and conclusions accurately reflect the data collected and that manipulations to images are in accordance with Nature journal guidelines (http://tinyurl.com/cmmrp7), and minimization of obstacles to sharing materials, data and algorithms through appropriate planning."

This sort of policy from respected journals is seriously good for data curation!

[Updated to enable the Nature tinyurls.]

Monday, 2 March 2009

Report on Data Preservation in High Energy Physics

There's a really interesting (if somewhat telegraphic) report by Richard Mount of SLAC on the workshop on data preservation in high energy physics, published in the January 2009 issue of Ariadne. The workshop was held at DESY (Deutsche Elektronen-Synchrotron), Hamburg, Germany, on 26-28 January 2008.

"The workshop heard from HEP experiments long past (‘it’s hopeless to try now’), recent or almost past (‘we really must do something’) and included representatives form experiments just starting (‘interesting issue, but we’re really very busy right now’). We were told how luck and industry had succeeded in obtaining new results from 20-year-old data from the JADE experiment, and how the astronomy community apparently shames HEP by taking a formalised approach to preserving data in an intelligible format. Technical issues including preserving the bits and preserving the ability to run ancient software on long-dead operating systems were also addressed. The final input to the workshop was a somewhat asymmetric picture of the funding agency interests from the two sides of the Atlantic."

There's a great deal to digest in this report. I'd agree with its author on one section:

"Experience from Re-analysis of PETRA (and LEP) Data, Siegfried Bethke (Max-Planck-Institut für Physik)

For [Richard], this was the most fascinating talk of the workshop. It described ‘the only example of reviving and still using 25-30 year old data & software in HEP.’ JADE was an e+e- experiment at DESY’s PETRA collider. The PETRA (and SLAC’s PEP) data are unlikely to be superseded, and improved theoretical understanding of QCD (Quantum ChromoDynamics) now allows valuable new physics results to be obtained if it is possible to analyse the old data. Only JADE has succeeded in this, and that by a combination of industry and luck. A sample luck and industry anecdote:

‘The file containing the recorded luminosities of each run and fill, was stored on a private account and therefore lost when [the] DESY archive was cleaned up. Jan Olsson, when cleaning up his office in ~1997, found an old ASCII-printout of the luminosity file. Unfortunately, it was printed on green recycling paper - not suitable for scanning and OCR-ing. A secretary at Aachen re-typed it within 4 weeks. A checksum routine found (and recovered) only 4 typos.’

The key conclusion of the talk was: ‘archiving & re-use of data & software must be planned while [an] experiment is still in running mode!’ The fact that the talk documented how to succeed when no such planning had been done only served to strengthen the conclusion."

I had heard of this story from a separate source (Ken Peach, then at CCLRC), so it's good to see it confirmed. I think the article that eventuated is

Bethke, S. (2000). Determination of the QCD coupling α_s J. Phys. G: Nucl. Part. Phys., 26.

One particularly sad remark from Amber Boehnlein (US Department of Energy (DOE))

"Amber was clear about the DoE/HEP policy on data preservation: ‘there isn’t one.’"

The DCC got a mention from David Corney of STFC, who runs the Atlas Petabyte Data Store, however I can confirm that we don't have 80 staff, or anywhere near that number (just under 13 FTE, if you're interested!). The reporter may have mixed us up with David's group, which I suspect is much larger.

In the closing sessions, we have Homer Neal,

"who set out a plan for work leading up to the next workshop. In his words:
‘establish the clear justification for Data Preservation & Long Term Analysis
establish the means (and the feasibility of these means) by which this will be achieved
give guidance to the past, present and future experiments
a draft document by the next meeting @SLAC.’"

Well worth a read!

Thursday, 26 February 2009

UKRDS session 4

At last the big bucks: Paul Hubbard from HEFCE. Policy case for a national service: if it needs to be done, can be done, and they have the resources, then they will do it. National policy & funding picture: have had a very good ten years (CR: sounds ominous) for a world-class national research base. Note HEFCE now a minority funder of research; comes about 3rd in the ranking in UK. Can’t say HEFCE will fund this but can say HEFCE will work with others to fund it. Low note: recession ☹. HEFCE role: hands out money for research but doesn’t say what it’s for, celebrate that. Research base is competitive, dynamic & responsive, and efficient.

How? Excellent strong & efficient research information resources. Print services, collaborative collection management, national research libraries, BL Document Supply centre, UK Research Reserve (launched last week with £10M) building on it. Online resources: licensing EJs, JISC Infrastructure services and DCC as key player (hey, he said that!!!).

Making the case: HEFCE hopes to see a good, strong clear, specific case. Case made in general but not yet in detail. Drawing on experience with UK Research Reserve, need to show it meets a proven need in the research base, that data is being and will be produced that can be (and is) re-used, and that researchers will collaborate in ensuring data stored and demand access to it later. Need a consortium of funders, notably RCs, Health Services and major charities, for sustainability and efficiency. Need standards for what is kept, for how long and in what form, cataloguing and metadata, security, and scale. Key point, need to improve efficiency in research process, so need sound business case. Evidence for levels of current and foreseeable demand, evidence for savings from storing data that would otherwise have to be re-created later. Also the capital and recurrent costs of doing all this.

Need strong arrangements for management and future sustainability: ownership of storage, of the system, of the data. Who determines standards & procedures. Arrangements need to be designed with a good chance for lasting many years, and that the implementation is thoroughly planned. So good in principle, needs more detail, so, over to you (us!).

Malcolm Read, head of JISC. JISC already funds a lot of work in this area. Three areas:

technical advice and standards: DCC (he says he would not want to move the DCC “at this stage” into the fledgling UKRDS, thank goodness);
practitioner advice: UKRDS;
Repositories split two ways, subject repositories from RCs, but won’t be universal in UK, institutional repositories from HEIs maybe a SHED shared service option. (CR: what is this? Another shared data centre proposal? Maybe it’s about sharing the storage?)

UKRDS to give advice on

- What to save
- How to save it
- When to discard it, and
- Subject specialist.

Need some sort of horizontal vision across the infrastructure (CR, my orthogonal curation layer???) So more work to do:

- ownership/IPR (don’t care who but does care it’s unambiguous)
- how much to save
- capital (startup) costs - funding service
- recurrent costs – long term sustainability
- coherence across repositories (wants more assurance and comfort on this)
- scope for creating a data management profession (career opportunities, hierarchy, training & skills etc; cf learning technologists & digital librarians).

Governance options: depends on ownership: HEIs or FCs; responsibilities etc. Key is advisory rather than regulatory. Recommends: HEI owned, possibly in receipt of FC/JISC grant; initially a formal partnership of HEIs under an MOU; eventually a separate legal entity by subscription.

Q&A: CR on UK equivalent of Australian Code for Responsible conduct of research, UK equivalent much weaker, revision been in consultation but perhaps not much changed. Malcolm Read also impressed by the Australian code, and also by the German statement last year. Astrid Wissenburg from ESRC, says it’s more than just the Code. But also welcomes the collaborative approach, but wonders at the HEI-focused approach. Paul H expects links to RC-funded centres to be key. Sheila Cannell from Edinburgh on the suggested profession: does it emerge from researchers or from existing professions such as librarians, archivists, IT etc. Malcolm says: yes!

Simon Courage (?) Sunderland, from aspect of researcher recognition: will citations of data be counted in next REF. Paul Hubbard says short answer is no, as no clear way of doing it. Not willing to act as incentiviser of everything; should be done because it’s right for the science and required by the institution, and only enforced if nothing else works. Schurer: of 70 staff, only 2 from library backgrounds, so doesn't agree with Malcolm. Read, yes lots of people think that, not all of them are right (???). Mark Thorley from NERC: really need to make use of the knowledge and expertise in existing data work. (CR: library key because of organisational centrality/locus?).

Closing speech by David Ingram from UCL: summing up and moving forward. Science being transformed. Bio-informatics now core discipline of biology (this very room, 2005). Research & practice increasingly information intensive. Multiple legacy information systems in use; problems in supporting and linking health care, research and industry. Government creating pervasive new ICT infrastructure & core services. Other national/international initiatives creating relevant infrastructures & services. Talks about the MRC Data Sharing Service. Some nice Escher pics: order and chaos: use & re-use of data needs to be careful and context aware. Got to be very sensitive to the nature & provenance of the data.

Information explosion in health care: shows growth in SNOMED terminology to around 200K terms, and this is getting away from us. Too many ways of saying the same thing and too hard to work out that it’ so. Comment on importance of stopping people doing (unuseful) things.

Data explosion. Transferred at 100 MB/sec (??), a Petabyte takes 116 days, so need a van for your backup.

The circle of knowledge: are there logical relationship, epistemology about linking things across domains (CR: I may have missed the point there).

Dimensions of challenge

Diversity
Discipline and standards
Scale
Evolution over time
Willingness and capacity to engage (potential losers as well as winners)
Education and capacity at all levels
Implementation: be completely realistic

Dilemmas

Shareable vs shared; to do with motivation and practice improvement
Commonality vs diversity; requirements
Confidential/restricted vs public; about access management and control
Competition vs cooperation; modus operandi
Global vs local; implementation, capacity, organisation and resources
Standardised for general applicability vs optimised for particular purposes; what approach will work

Direction of travel: problem is urgent; research requirements should drive data standards and shared services; research communities must value and own the outcomes; practical experience in implementation (CR: read, existing data centres etc?) should guide policy, strategy & investment; and rigorously control scope, sclae and timing.

Zimmerli: IT as a horizontal technology that allows us to do previously impossible things.

Finishes with Escher picture Ascending and descending, to suggest a need for a constructive balance between top-down and bottom-up. Needs an experimental approach; will need to evolve. Times in 1834: unlikely the medical profession will ever use stethoscopes, as their use requires substantial training and involves great difficulties. UKRDS in this sort of area?

Martin Lewis closing: keep this on the agenda, can’t get everything right, so we need to start.

UKRDS session 3

Ross Wilkinson, Director of Australian National Data Service (ANDS),live via satellite (saving carbon, if not sleep) has been building on lot of work particularly in the UK. Role of data more etc; now answer questions unrelated to why the data were collected originally. Who cares and why varies a lot. Pleased that Australian Govt cares to extent of $24M. Institutions driven by the Code for responsible conduct of research (CR: blogged about this before): have to take data seriously. Not clear if disciplines care at the national level (some have cared at an international level). Researchers SHOULD care, but though he has been a researcher he hasn’t cared enough. So why should they? First the Code, second the role of data citation, 3rd changing nature of research.

The Code has obligations on the funders, the institutions AND the researchers. Probably a bit of a sleeper (researchers largely unaware yet), but funders are going to ratchet this up.

Fame: data citation. Sharing detailed research data associated with increased citation rate. This is about stories: Human Genome, Hubble telescope.

ANDS has 4 coordinated inter-related service delivery programs: frameworks, utilities, seeding the commons, and capability development. Plus service development funded through NeAT in partnership with ARCS (foreign acronym soup is so much less clear than one’s own!)

Frameworks on influencing national policies, eg count data citation rates just like publication citation rates; proposals should have slot for data space. Common understanding of data management issues, eg what is best practice. Don’t see much best practice; what is decent practice?

Utilities: discovery service (you come to us, we come to you flavours), persistent identifiers for deposited data, and collections registry.

Seeding the commons: improve & standardise institutional repositories; need routine deposit in sustainable infrastructures, etc (missed some).

Capability development: assist researchers to align data management practice, to improve quantity & quality of data, to do the right thing.

Not too much theory: build stuff. National collections registry service ready, work on persistent identifier and discovery services. Exemplar discovery spaces (water? Crystallography, law, biosecurity?). Advice not storage or management for data: this is done locally but networked. Layer on top of data.

Shows exemplar page on Dugongs, aquatic mammal found off the Australian coast. Might build linkages from data and researchers to allow browsing of the space. Using ISO 2146 to drive this: collections, parties, activities, services etc all need to be described, and the relationships matter. Not discovery but browsing service? Harvest info from a wide variety of sources beyond the data.

Talks about See Also service: other info to pop up (cf Amazon; other data come to you). Maybe could share such a service with the UK; if you’re interested in this then you might also be interested in this other stuff happening in the UK…

What do researchers get? Local managed data repository (from code); persistent identified data, and data discovery. Mission is more researchers sharing/re-using more data more often. Lower costs and raise benefits.

Q&A (so Ross can go home to bed): Robin Rice from Edinburgh: the Repository word, mny think about it in terms of publications, how does it differ? Problem word, with overloaded usage. Not necessarily a big issue; may use different underlying technology, but that’s not that important. Different tools and indexes maybe, but conceptually not that different.

Kevin Schurer again: Likes talk but fundamentally disagrees with one thing: quantity over quality? Our funders would prefer to ingest 10 datasets a year with huge use and relevance than 4,000 that don’t. It’s not really about quantity or quality but impact. Could be a data car-boot sale rather than data Sotheby’s. Ross likes an argument but does disagree. Eg Monash group with carefully crafted crystallography datasets, but want to get people through the door and then move up in quality, driven by discipline and community.

John Wood, what if institution doesn’t have capability to hold the data. Ross, many looking for funding to do so, but they could if they wanted enough. ANDS not the source of such funding.

Peter Burnhill again: mentioned Geospatial data, as in See also, could you say more about spatial infrastructure work going in and around ANDS. Ross: two areas: marine and earth sciences, with slightly different take on the problem. ANDS doesn’t mind which way wins, want to make sure that it’s engaged with an international perspective, particularly data that’s not intrinsically geo-referenced but geo-ref adds value, eg Atlas of Living Australia.

Mark Thorley from NERC again: any work to add value to data; good work on discovery and persistence, but they find researchers don’t want data in the form generated, but in a way more convenient for re-use. Who does this? Ross: mainly outside; have some funding for such developments but oriented to communities that want to use the data rather than providers. Two areas of struggle: data mining, how useful is that (versus model-based), secondly data visualisation. Brokering conversations but favouring community rather than technology.

Now closer to home, Carlos Morais Pires from EU perspective: capacity-building; domains of infrastructures, and then renewed forward strategy.

Capacities programme versus research programme: a new thing in FP7. What can e-Infrastructures (not ICTs) do for science? Parsimony: sharing resources; increase the user base; but… science is excellence. E-Science double face: e-infrastructures change science, also scientists change e-infrastructures. Shows a 5-layer model, e-Science on top of comms, on theory, on experiment, on ???

Domains: innovating science process: global virtual research communities; accessing knowledge (science data); sharing resources (grid); linking (Geant); designing future facilities (novel e-infrastructures). Aha the programme we heard of before: DEISA and petascale machine.

Information continuums most interesting for today’s conference: between past and present, between raw data and results, between science disciplines, between institutions, between research & education. Nice picture of an integrated view showing LHC, astronomy, biology and health/clinical data on same GRID (CR: be afraid?).

Looking ahead: continue dialogue with horizon of 2020+. Early calls in FP7 heavily over-subscribed by factors of 5 or 8, means very interesting proposals not funded. Some work still for next 2 years. New resolution hope will be adopted next week, and hope EU funding will reflect importance.

Stefan Winkler-Nees from DFG, German approaches and policies. All sitting in the same boat, facing the same problems. 2003, DFG signed OA Berlin Declaration. 2006, DFG position paper on funding priorities through 2015. Last year new national priority initiative on research data. Principle is that DFG tries to support scientists, bringing them together with information experts from libraries, archives and IT (didn’t say Computing Science). Criteria; substantial added value for supra-regional infrastructures, contributions to advancement of scientific infrastructure, science relevance, etc. Subtext: complex funding infrastructure; key is the alliance of German science organisations. June 22, 2008: national Priority Initiative “Digital Information”. Don’t forget legal frameworks.

Have working group on primary research data, drafting policy to promote the need for action and demonstrate usefulness.

Prime objective: transparency, traceability and follow-up use. Honour principles of safeguarding good scientific practice. Can publication of data become a self-evident part of the science recognition system?

What’s on the agenda: disciplines, working across sectors, encouraging publication of datasets, and exploring the international perspective.

Q&A again: Stephen Pinfield from Nottingham, interested in incentivising. Stick approach from funders, maybe other incentives, eg improved citations (CR, see http://digitalcuration.blogspot.com/2008/09/many-citations-flow-from-data.htm, quote "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets"). Schurer, what about the JISC Data Oscars; Malcolm Read maybe the data ostriches would be easier?

Burnhill again: reward system key, data must get some sort of recognition; lots of bad data just as bad papers in journals.

UKRDS Conference 2

John Coggins, VP for Life Sciences from Glasgow on researcher’s perspective. Wants to prompt us to think, about “culture change” and resources.

Volume & complexity of digital is rapidly increasing. In many fields data are rarely used outside lab/dept. Data management skills under-developed, research data often unstructured & poorly documented & inaccessible. Represents huge un-tapped resource. Never even fully analysed by researchers who generate it. Could be much added value from data mining & combining datasets.(CR from break conversation with Jmes Currall: need lots of additional tacit context knowledge to understand properly.) Most data stored locally and often lost. Most researchers say they are willing to share data, usually through informal peer exchange networks (CR: important distinction!). Some of those who deposit data regularly do so because journals won’t publish unless they do (CR: Bermuda agreement in bio-sciences; Alma Swan says compliance lower than expected).

Funders want to protect & enhance their investment by making data widely available,. Must be agreement among researcher on quality and format. Slow to achieve, cf story about Protein Data Bank. Story about surplus computers from particle physicists in the 1970s, playing with 7 protein structures; lots of subsequent arguments about the best way to store this stuff in those primitive computers; now 36 years on, 44,000 structures available, huge investment since then (eg early stuff re-examined in light of later advances). Essential to invest in data management, curation & storage as well as for easy access. Needs international collaboration & substantial funding for permanent delivery, eg EBI.

Researchers want confidence their data will be permanently stored & accessible; tht costs will be met; be able to access other people’s data, preferably on world-wide basis; and training.

Complications: commercially sensitive data, personal data, eg patient data, etc.

All Universities should have access to a UK-wide service. FCs and RCs likely to provide funding (goodwill but not yet much in £ notes). What about private sector? Charities? Other government departments like DEFRA (no-one from DEFRA in the room). These departments not celebrated for willingness to commit resources to research sector.

Way forward: research community would welcome UK-wide approach. Need more consistency, will take time. Researchers need to improve their skills for managing & using data. Significant building blocks in place. International dimension important. Culture change and mindset critical and must be gradually changed, needs investment in training & infrastructure, & spreading best practice. This is the way forward but won’t be easy.

David Lynn from Wellcome Trust (independent charity interested in biomedical research) for a funder’s perspective. (CR: is bio-science TOO good an example; poster child from which wrong inferences can be drawn?) Talking about size & rate of data, about integrating large datasets to gain insights into complex systems (CR: via Google Earth!!! Reminds me of NERC meeting last week again!) Now on immense potential to link research papers and data (CR: yeah!). Also growth in traffic of researchers accessing data from others (via web traffic on Sanger Institute site).

Meeting challenges: infrastructure: key data resources need coordinated and long-term sustainable funding. ELIXIR programme aiming to build sustainable infrastructure across Europe. Technical and cultural challenges: coordination & advocacy from key communities (funders, institutions, publishers etc). Data security challenge, important for some parts of bio-medical community. Recent high profile incidents in UK. Recent reports from Academy of Medical Sciences, Council of Sci & tech, Thomas/Walport review, US Inst of Medicine. Calls for safe havens & codes of practice. So management and governance of such data will be a key concern to retain public confidence. Mentions Bermuda Principles (1996) & Fort Lauderdale principles (2003); need to involve researchers in all such discussions,

Agrees with coordinated approach to preserving key research data & ensure long-term value; must meet needs of researchers & funders. Devil is in the detail; must develop in way that truly adds value; must link effectively with existing activity & expertise; will need buy-in of ALL major funders & stakeholders; will need to accommodate differences in approach between funders & disciplines; & must appropriately resource initiative.

Would like clarification of detail on Pathfinder study, but like to see it go forward; full specification, but with careful assessment of funding implications.

Q&A session. Malcolm Atkinson again. PDB took place initially at national lab at Brookhaven; perhaps the researchers had a bit of time. Those who put together the data for re-use don’t get an immediate reward. Those who get the immediate reward don’t pay attention to the long term but very focused on immediate need. PhD skills means PhD courses take longer. Where does the extra resource come from to invest in this change in researchers, and how to invest in incentive (promotion, RAE etc) to make sure this happens? John Coggins: it’s going to be hard. Was working at Brookhaven at the time of PDB, did manage to publish wee paper, but true, none of those involved are now household names. Focus on VFM from government doesn’t help as too short term. David Lynn: 3 ways to recognise work: institutions themselves think about their own promotion criteria, FCs think about REF and consider metrics for such activities other than direct research; research funders now requiring data management plans.

Michael Jubb from RIN: Tension between disciplinary difference versus one size fits all. Coggins: consistency means common attitude that it’s worth sharing data and accessible; fierce arguments on data structures etc; seems to be blaming arts & humanities for being poorly funded and behind the digital times (CR: largely not fair!). Lynn: engage with researchers not in the abstract.

Bryan Lawrence again: no new money just old money re-purposed. Always tension between research, teaching & data, and only 1 pot for all of it, so where decision point lands will differ.

Peter Burnhill, EDINA: statement makers versus artisans? Observing what’s happening, can assuming the internet is always on. Need to think about what it means to publish data which includes archival issues. Not just take away, but software access. Variety out there, will be drivers, researchers are driven. What constitutes-well-published data including some notion of designated community to which you are publishing. Lynn, funders are increasing expectation on what researchers do in this respect, but can only drive this so far. Wellcome want researchers to tell us in DM plans what their approach will be to curating the data, making it available (including when); this DM plan is then peer-reviewed (larger grants, population studies etc) as part of the selection process. Coggins: well run labs with careful research systems means data better documented. Burnhill, provenance an issue, yes. Need to understand the drivers for someone wanting to re-analyse someone else’s data.

Kevin Schurer again: reflection on Qs so far. Making fruits of research available not a new conundrum, RS celebrating 350 years. The ways this is done very segmented: researcher, publisher, copy-editor; very segmented process. Problem with data (varies from disciplines): PI often the data generator, publisher & often distributor as well. UKRDS provides possibility to separate those things out again. Agreement from panel.

UKRDS conference 1

(Andy Powell liveblogging this at http://tinyurl.com/dhfdbn)

Chaired by John Wood, who chairs the UKRDS Steering Committee, talking big numbers: 20 PByte per year from CERN, 200 PByte per year from some German project…

Jean Sykes, Project Director on the Challenge of Data Lifecycle Management.. The challenge includes the data deluge, but really it’s about research data as an untapped resource, due to lack of coherent/consistent policies & standards, not to mention infrastructure. Request to libraries for storage in repositories, no added value. But it’s the management of data through the whole lifecycle from creation through to re-use, with storage as one part, that is needed.

Cliff Lynch, Nature: “digital data are so easily shared and replicated and so recombinable, they present tremendous re-use opportunities…”

Four case studies: Bristol, Leeds, Leicester, Oxford (doing their own study at the same time).

Survey results: data volume 360% in 3 years; 50% of data has useful life of up to 10 years, 26% indefinite retention value. Most data held locally (often precarious). 43% think their research could be improved by access to more data.

DCC Lifecycle model “provides a useful standard for data management”. DRAMBORA gets a mention too. JISC Information Environment & JANET provide the infrastructure. JISC & RIN studies provide context.

So a more coherent UK-wide approach is feasible & desirable.

US model: NSF Datanets $100M over 5 years, second round in process. Australia, centralised top-down (but also distributed??).

UKRDS to address sustainability, not just about storage. Many building blocks in place but gaps to be filled, so embrace rather then replace existing facilities, to leverage research value.

Next we have John Milner on the Pathfinder approach. Aim to build a fully functional service but with only a limited population of stakeholders. Comes up with that boil the ocean metaphor that I’m sorry, I hate! Hope is we can build something that scales; something more than a conventional pilot. So start and continue, rather than start and re-start. Pathfinders will be the case studies plus Cardiff, Glasgow & Edinburgh (gets 3 HE funders on board).

Service requires clear expectation for both client and server on what it offers and what it doesn’t. Hmmm unreadable stakeholder diagram, maybe in the notes. Segmented into early and later adopters. Another diagram: simplified view of research service and research data sharing process, plus UKRDS services and administration. Lots of this is advisory, policy, strategy etc. Not much if anything I can see about storage or curation. I’m not quite getting it yet; no long term without short term!!! Front office/back office division; maybe the back office does the hard stuff?

Diagram emerges showing overlap between UKRDS, DCC and RIN. Either alarming or comforting for us, depends how you read it! Also nominate UKDA to help.

DCC & UKDA can provide infrastructure & tools… RIN with ongoing requirements planning &QA…

Summing up as feasible… but not yet addressed where the CURATION is done!

Now Q&A… John Wood on EU 5th freedom: freedom of knowledge!

Malcolm Atkinson, e-Science envoy: glad to see integrated approach, but bit worried maybe arbitrary boundary. Value of data limited if can’t compute on it. Need to think carefully about computational access at same time as data management; hoped to see linkage with e-Science such as NGS; integrated approach necessary. John Wood agrees; EU programme called PREIS??? Always 1 world-leading computer in EU though may move around.

Kevin Schurer, UKDA still confused about process; second slide on “demand-led”. Something missing: the supply; have we over-estimated relationship between demand and supply. 5 Veg a day example; people say yes but don’t do it. Similar for research data: agree they should share but don’t do the necessary: documentation, metadata, etc, so getting useful data out of researchers is hard (this is bitter experience talking!). Do local institutions have enough skill to support this? JM answers: demand is for the service, reservations are right, but focus is on gaps. Primary requirement is for researchers to do coherent data management plans so that’s where we start. Execution of DM plans another thing? JS: we don’t know until we try it. (CR surely part of the answer is building it into the research process through automation?) Kevin’s response: fear is 4 institutions not the most typical in the country, so can it scale up to the wider group? Leicester as the non-Russell Group uni.

Mark Thorley, NERC, and OECD WG on data management guidelines. Solution fundamentally flawed? NERC spends more in 1 year than UKRDS on 5 years, but still only scratching the surface. Naïve on what can be done. Need to talk now in DETAIL on people with real experience of difficulties of getting researchers engaged in process. Somewhat fudgy answer: have to start somewhere.

Bryan Lawrence, Director of Environmental archival: miss Mark’s point: worth doing but still hopelessly naïve, and frightened by the lack of relaisation that first attempt is bound to be wrong and may need to be thrown away and re-done. JM answer: point is that commitments to researchers have to be honoured. BL: make sure the scope of what you’re committing reflects the scope of what you can do.

Digital Curation Blog