Digital Curation Blog: Data management

Showing posts with label Data management. Show all posts

Thursday, 6 August 2009

JISC Data Management Infrastructure call

I gather from a tweet from Simon Hodson (Programme Manager at JISC for the Data Management Infrastructure call, 07/09), that 34 bids were received for this call. With maybe 8 able to be fiunded within the funding envelope, this looks like a very strong chance of getting an extremely interesting programme. I guess quite a few of us will be getting a batch of proposals to review in the next few days; despite a small feeling of dread at the workload (mostly evenings), I usually find I quite enjoy this process! There's so much to learn, even if (theoretically) we are supposed to forget it immediately!

Thursday, 23 July 2009

Managing and sharing data

I really like the UK Data Archive publication "Managing and Sharing Data: a best practice guide for researchers". It's been available in print form for a while, although I did hear they had run out and were revising it. Meanwhile, you can get the PDF version from their web site. It is complemented by sections of their web site which reflect, and sometimes expand on, the sections in the Guide. The sections are:

Sharing data - why and how?
Consent, confidentiality and ethics
Copyright
Data documentation and metadata
Data formats and software
Data storage, back-up, and security

This is an excellent resource!

Thursday, 4 June 2009

SNIA "Terminology Bridge" report

Quite a nice report has just been published by the Storage Networking Industry Association's Data Management Forum, called "Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation Practices in the Datacenter". I don't think I'd agree with everything I saw on a quick skim, but overall it looks like a good set of terminology definitions.

The report identifies "two huge and urgent gaps that need to be solved. First, it is clear that digital information is at risk of being lost as current practices cannot preserve it reliably for the long-term, especially in the datacenter. Second, the explosion of the amount of information and data being kept long-term make the cost and complexity of keeping digital information and periodically migrating it prohibitive." (I'm not sure that I agree with their apocalyptic cost analysis, but it certainly deserves some serious thought!)

However, while still addressing these large problems, they found that what "began as a paper focused at developing a terminology set to improve communication around the long-term preservation of digital information in the datacenter based on ILM[*]-practices, has now evolved more broadly into explaining terminology and supporting practices aimed at stimulating all information owning and managing departments in the enterprise to communicate with each other about these terms as they begin the process of implementing any governance or service management practices or projects related to retention and preservation."

It's worth a read!

(* ILM = Information Lifecycle Management, generally not related to the Curation Lifecycle, but oriented towards management of data on appropriate storage media, eg moving less-used data onto offline tapes, etc.)

Tuesday, 2 June 2009

New JISC Research Data Management Programme

I have not been posting for over a month now, due to pressure of work (bid writing, mainly), and its curiously hard to get back into it. There's plenty of things to write about; too many, really; too hard to pick the right one. But now something has arrived that absolutely demands to be blogged: the JISC Research Data Programme, Data Management Infrastructure Call for Projects (JISC 07/09) has been released (closing data 6 August 2009, Community Briefing 6 July 2009)!

We are very excited about this. The Programme aims for 6-8 projects in English or Welsh (not Scottish or Irish, grrrr) institutions that will identify requirements to manage data created by researchers, and then will deploy a pilot data management infrastructure to address these requirements. Coupled with some projects already funded under the JISC 12/08 call (CLARION at Cambridge, Bril at Kings, EIDCSR at Oxford, Lifespan RADAR at Royal Holloway, and the Materials Data Centre at Southampton), and some work not eligible for this call, such as Edinburgh DataShare, these projects will really begin to build experience in managing research data at institutional level (or groups of institutions) in the UK.

The DCC has a key role in the Programme. Apart from mentions of the Curation Lifecycle and the Data Audit Framework, paragraph 30 of the Call says

"30. Role of the Digital Curation Centre (DCC): Bidders are invited to consult with the DCC in preparing their bids. The DCC will provide general support for this strand of activities and for the programme more broadly. This will be done by contributions to programme events as well as the current channels of information, and through its principal role as a broker for expertise and advice in the management and curation of data. Projects are encouraged to engage directly with the DCC and its programme of information exchange - for example, by contributing to the Research Data Management Forum (RDMF)."

We are preparing to deliver on this role both in the short term (during the bid-writing phase), and later during the Programme execution. I have been very pleased to work with Simon Hodson, the JISC Programme Manager for this Call. I'm sure Simon does not yet realise how much influence he will be exerting over the future of research data curation in institutions in the UK!

Thursday, 26 February 2009

UKRDS session 4

At last the big bucks: Paul Hubbard from HEFCE. Policy case for a national service: if it needs to be done, can be done, and they have the resources, then they will do it. National policy & funding picture: have had a very good ten years (CR: sounds ominous) for a world-class national research base. Note HEFCE now a minority funder of research; comes about 3rd in the ranking in UK. Can’t say HEFCE will fund this but can say HEFCE will work with others to fund it. Low note: recession ☹. HEFCE role: hands out money for research but doesn’t say what it’s for, celebrate that. Research base is competitive, dynamic & responsive, and efficient.

How? Excellent strong & efficient research information resources. Print services, collaborative collection management, national research libraries, BL Document Supply centre, UK Research Reserve (launched last week with £10M) building on it. Online resources: licensing EJs, JISC Infrastructure services and DCC as key player (hey, he said that!!!).

Making the case: HEFCE hopes to see a good, strong clear, specific case. Case made in general but not yet in detail. Drawing on experience with UK Research Reserve, need to show it meets a proven need in the research base, that data is being and will be produced that can be (and is) re-used, and that researchers will collaborate in ensuring data stored and demand access to it later. Need a consortium of funders, notably RCs, Health Services and major charities, for sustainability and efficiency. Need standards for what is kept, for how long and in what form, cataloguing and metadata, security, and scale. Key point, need to improve efficiency in research process, so need sound business case. Evidence for levels of current and foreseeable demand, evidence for savings from storing data that would otherwise have to be re-created later. Also the capital and recurrent costs of doing all this.

Need strong arrangements for management and future sustainability: ownership of storage, of the system, of the data. Who determines standards & procedures. Arrangements need to be designed with a good chance for lasting many years, and that the implementation is thoroughly planned. So good in principle, needs more detail, so, over to you (us!).

Malcolm Read, head of JISC. JISC already funds a lot of work in this area. Three areas:

technical advice and standards: DCC (he says he would not want to move the DCC “at this stage” into the fledgling UKRDS, thank goodness);
practitioner advice: UKRDS;
Repositories split two ways, subject repositories from RCs, but won’t be universal in UK, institutional repositories from HEIs maybe a SHED shared service option. (CR: what is this? Another shared data centre proposal? Maybe it’s about sharing the storage?)

UKRDS to give advice on

- What to save
- How to save it
- When to discard it, and
- Subject specialist.

Need some sort of horizontal vision across the infrastructure (CR, my orthogonal curation layer???) So more work to do:

- ownership/IPR (don’t care who but does care it’s unambiguous)
- how much to save
- capital (startup) costs - funding service
- recurrent costs – long term sustainability
- coherence across repositories (wants more assurance and comfort on this)
- scope for creating a data management profession (career opportunities, hierarchy, training & skills etc; cf learning technologists & digital librarians).

Governance options: depends on ownership: HEIs or FCs; responsibilities etc. Key is advisory rather than regulatory. Recommends: HEI owned, possibly in receipt of FC/JISC grant; initially a formal partnership of HEIs under an MOU; eventually a separate legal entity by subscription.

Q&A: CR on UK equivalent of Australian Code for Responsible conduct of research, UK equivalent much weaker, revision been in consultation but perhaps not much changed. Malcolm Read also impressed by the Australian code, and also by the German statement last year. Astrid Wissenburg from ESRC, says it’s more than just the Code. But also welcomes the collaborative approach, but wonders at the HEI-focused approach. Paul H expects links to RC-funded centres to be key. Sheila Cannell from Edinburgh on the suggested profession: does it emerge from researchers or from existing professions such as librarians, archivists, IT etc. Malcolm says: yes!

Simon Courage (?) Sunderland, from aspect of researcher recognition: will citations of data be counted in next REF. Paul Hubbard says short answer is no, as no clear way of doing it. Not willing to act as incentiviser of everything; should be done because it’s right for the science and required by the institution, and only enforced if nothing else works. Schurer: of 70 staff, only 2 from library backgrounds, so doesn't agree with Malcolm. Read, yes lots of people think that, not all of them are right (???). Mark Thorley from NERC: really need to make use of the knowledge and expertise in existing data work. (CR: library key because of organisational centrality/locus?).

Closing speech by David Ingram from UCL: summing up and moving forward. Science being transformed. Bio-informatics now core discipline of biology (this very room, 2005). Research & practice increasingly information intensive. Multiple legacy information systems in use; problems in supporting and linking health care, research and industry. Government creating pervasive new ICT infrastructure & core services. Other national/international initiatives creating relevant infrastructures & services. Talks about the MRC Data Sharing Service. Some nice Escher pics: order and chaos: use & re-use of data needs to be careful and context aware. Got to be very sensitive to the nature & provenance of the data.

Information explosion in health care: shows growth in SNOMED terminology to around 200K terms, and this is getting away from us. Too many ways of saying the same thing and too hard to work out that it’ so. Comment on importance of stopping people doing (unuseful) things.

Data explosion. Transferred at 100 MB/sec (??), a Petabyte takes 116 days, so need a van for your backup.

The circle of knowledge: are there logical relationship, epistemology about linking things across domains (CR: I may have missed the point there).

Dimensions of challenge

Diversity
Discipline and standards
Scale
Evolution over time
Willingness and capacity to engage (potential losers as well as winners)
Education and capacity at all levels
Implementation: be completely realistic

Dilemmas

Shareable vs shared; to do with motivation and practice improvement
Commonality vs diversity; requirements
Confidential/restricted vs public; about access management and control
Competition vs cooperation; modus operandi
Global vs local; implementation, capacity, organisation and resources
Standardised for general applicability vs optimised for particular purposes; what approach will work

Direction of travel: problem is urgent; research requirements should drive data standards and shared services; research communities must value and own the outcomes; practical experience in implementation (CR: read, existing data centres etc?) should guide policy, strategy & investment; and rigorously control scope, sclae and timing.

Zimmerli: IT as a horizontal technology that allows us to do previously impossible things.

Finishes with Escher picture Ascending and descending, to suggest a need for a constructive balance between top-down and bottom-up. Needs an experimental approach; will need to evolve. Times in 1834: unlikely the medical profession will ever use stethoscopes, as their use requires substantial training and involves great difficulties. UKRDS in this sort of area?

Martin Lewis closing: keep this on the agenda, can’t get everything right, so we need to start.

UKRDS session 3

Ross Wilkinson, Director of Australian National Data Service (ANDS),live via satellite (saving carbon, if not sleep) has been building on lot of work particularly in the UK. Role of data more etc; now answer questions unrelated to why the data were collected originally. Who cares and why varies a lot. Pleased that Australian Govt cares to extent of $24M. Institutions driven by the Code for responsible conduct of research (CR: blogged about this before): have to take data seriously. Not clear if disciplines care at the national level (some have cared at an international level). Researchers SHOULD care, but though he has been a researcher he hasn’t cared enough. So why should they? First the Code, second the role of data citation, 3rd changing nature of research.

The Code has obligations on the funders, the institutions AND the researchers. Probably a bit of a sleeper (researchers largely unaware yet), but funders are going to ratchet this up.

Fame: data citation. Sharing detailed research data associated with increased citation rate. This is about stories: Human Genome, Hubble telescope.

ANDS has 4 coordinated inter-related service delivery programs: frameworks, utilities, seeding the commons, and capability development. Plus service development funded through NeAT in partnership with ARCS (foreign acronym soup is so much less clear than one’s own!)

Frameworks on influencing national policies, eg count data citation rates just like publication citation rates; proposals should have slot for data space. Common understanding of data management issues, eg what is best practice. Don’t see much best practice; what is decent practice?

Utilities: discovery service (you come to us, we come to you flavours), persistent identifiers for deposited data, and collections registry.

Seeding the commons: improve & standardise institutional repositories; need routine deposit in sustainable infrastructures, etc (missed some).

Capability development: assist researchers to align data management practice, to improve quantity & quality of data, to do the right thing.

Not too much theory: build stuff. National collections registry service ready, work on persistent identifier and discovery services. Exemplar discovery spaces (water? Crystallography, law, biosecurity?). Advice not storage or management for data: this is done locally but networked. Layer on top of data.

Shows exemplar page on Dugongs, aquatic mammal found off the Australian coast. Might build linkages from data and researchers to allow browsing of the space. Using ISO 2146 to drive this: collections, parties, activities, services etc all need to be described, and the relationships matter. Not discovery but browsing service? Harvest info from a wide variety of sources beyond the data.

Talks about See Also service: other info to pop up (cf Amazon; other data come to you). Maybe could share such a service with the UK; if you’re interested in this then you might also be interested in this other stuff happening in the UK…

What do researchers get? Local managed data repository (from code); persistent identified data, and data discovery. Mission is more researchers sharing/re-using more data more often. Lower costs and raise benefits.

Q&A (so Ross can go home to bed): Robin Rice from Edinburgh: the Repository word, mny think about it in terms of publications, how does it differ? Problem word, with overloaded usage. Not necessarily a big issue; may use different underlying technology, but that’s not that important. Different tools and indexes maybe, but conceptually not that different.

Kevin Schurer again: Likes talk but fundamentally disagrees with one thing: quantity over quality? Our funders would prefer to ingest 10 datasets a year with huge use and relevance than 4,000 that don’t. It’s not really about quantity or quality but impact. Could be a data car-boot sale rather than data Sotheby’s. Ross likes an argument but does disagree. Eg Monash group with carefully crafted crystallography datasets, but want to get people through the door and then move up in quality, driven by discipline and community.

John Wood, what if institution doesn’t have capability to hold the data. Ross, many looking for funding to do so, but they could if they wanted enough. ANDS not the source of such funding.

Peter Burnhill again: mentioned Geospatial data, as in See also, could you say more about spatial infrastructure work going in and around ANDS. Ross: two areas: marine and earth sciences, with slightly different take on the problem. ANDS doesn’t mind which way wins, want to make sure that it’s engaged with an international perspective, particularly data that’s not intrinsically geo-referenced but geo-ref adds value, eg Atlas of Living Australia.

Mark Thorley from NERC again: any work to add value to data; good work on discovery and persistence, but they find researchers don’t want data in the form generated, but in a way more convenient for re-use. Who does this? Ross: mainly outside; have some funding for such developments but oriented to communities that want to use the data rather than providers. Two areas of struggle: data mining, how useful is that (versus model-based), secondly data visualisation. Brokering conversations but favouring community rather than technology.

Now closer to home, Carlos Morais Pires from EU perspective: capacity-building; domains of infrastructures, and then renewed forward strategy.

Capacities programme versus research programme: a new thing in FP7. What can e-Infrastructures (not ICTs) do for science? Parsimony: sharing resources; increase the user base; but… science is excellence. E-Science double face: e-infrastructures change science, also scientists change e-infrastructures. Shows a 5-layer model, e-Science on top of comms, on theory, on experiment, on ???

Domains: innovating science process: global virtual research communities; accessing knowledge (science data); sharing resources (grid); linking (Geant); designing future facilities (novel e-infrastructures). Aha the programme we heard of before: DEISA and petascale machine.

Information continuums most interesting for today’s conference: between past and present, between raw data and results, between science disciplines, between institutions, between research & education. Nice picture of an integrated view showing LHC, astronomy, biology and health/clinical data on same GRID (CR: be afraid?).

Looking ahead: continue dialogue with horizon of 2020+. Early calls in FP7 heavily over-subscribed by factors of 5 or 8, means very interesting proposals not funded. Some work still for next 2 years. New resolution hope will be adopted next week, and hope EU funding will reflect importance.

Stefan Winkler-Nees from DFG, German approaches and policies. All sitting in the same boat, facing the same problems. 2003, DFG signed OA Berlin Declaration. 2006, DFG position paper on funding priorities through 2015. Last year new national priority initiative on research data. Principle is that DFG tries to support scientists, bringing them together with information experts from libraries, archives and IT (didn’t say Computing Science). Criteria; substantial added value for supra-regional infrastructures, contributions to advancement of scientific infrastructure, science relevance, etc. Subtext: complex funding infrastructure; key is the alliance of German science organisations. June 22, 2008: national Priority Initiative “Digital Information”. Don’t forget legal frameworks.

Have working group on primary research data, drafting policy to promote the need for action and demonstrate usefulness.

Prime objective: transparency, traceability and follow-up use. Honour principles of safeguarding good scientific practice. Can publication of data become a self-evident part of the science recognition system?

What’s on the agenda: disciplines, working across sectors, encouraging publication of datasets, and exploring the international perspective.

Q&A again: Stephen Pinfield from Nottingham, interested in incentivising. Stick approach from funders, maybe other incentives, eg improved citations (CR, see http://digitalcuration.blogspot.com/2008/09/many-citations-flow-from-data.htm, quote "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets"). Schurer, what about the JISC Data Oscars; Malcolm Read maybe the data ostriches would be easier?

Burnhill again: reward system key, data must get some sort of recognition; lots of bad data just as bad papers in journals.

UKRDS Conference 2

John Coggins, VP for Life Sciences from Glasgow on researcher’s perspective. Wants to prompt us to think, about “culture change” and resources.

Volume & complexity of digital is rapidly increasing. In many fields data are rarely used outside lab/dept. Data management skills under-developed, research data often unstructured & poorly documented & inaccessible. Represents huge un-tapped resource. Never even fully analysed by researchers who generate it. Could be much added value from data mining & combining datasets.(CR from break conversation with Jmes Currall: need lots of additional tacit context knowledge to understand properly.) Most data stored locally and often lost. Most researchers say they are willing to share data, usually through informal peer exchange networks (CR: important distinction!). Some of those who deposit data regularly do so because journals won’t publish unless they do (CR: Bermuda agreement in bio-sciences; Alma Swan says compliance lower than expected).

Funders want to protect & enhance their investment by making data widely available,. Must be agreement among researcher on quality and format. Slow to achieve, cf story about Protein Data Bank. Story about surplus computers from particle physicists in the 1970s, playing with 7 protein structures; lots of subsequent arguments about the best way to store this stuff in those primitive computers; now 36 years on, 44,000 structures available, huge investment since then (eg early stuff re-examined in light of later advances). Essential to invest in data management, curation & storage as well as for easy access. Needs international collaboration & substantial funding for permanent delivery, eg EBI.

Researchers want confidence their data will be permanently stored & accessible; tht costs will be met; be able to access other people’s data, preferably on world-wide basis; and training.

Complications: commercially sensitive data, personal data, eg patient data, etc.

All Universities should have access to a UK-wide service. FCs and RCs likely to provide funding (goodwill but not yet much in £ notes). What about private sector? Charities? Other government departments like DEFRA (no-one from DEFRA in the room). These departments not celebrated for willingness to commit resources to research sector.

Way forward: research community would welcome UK-wide approach. Need more consistency, will take time. Researchers need to improve their skills for managing & using data. Significant building blocks in place. International dimension important. Culture change and mindset critical and must be gradually changed, needs investment in training & infrastructure, & spreading best practice. This is the way forward but won’t be easy.

David Lynn from Wellcome Trust (independent charity interested in biomedical research) for a funder’s perspective. (CR: is bio-science TOO good an example; poster child from which wrong inferences can be drawn?) Talking about size & rate of data, about integrating large datasets to gain insights into complex systems (CR: via Google Earth!!! Reminds me of NERC meeting last week again!) Now on immense potential to link research papers and data (CR: yeah!). Also growth in traffic of researchers accessing data from others (via web traffic on Sanger Institute site).

Meeting challenges: infrastructure: key data resources need coordinated and long-term sustainable funding. ELIXIR programme aiming to build sustainable infrastructure across Europe. Technical and cultural challenges: coordination & advocacy from key communities (funders, institutions, publishers etc). Data security challenge, important for some parts of bio-medical community. Recent high profile incidents in UK. Recent reports from Academy of Medical Sciences, Council of Sci & tech, Thomas/Walport review, US Inst of Medicine. Calls for safe havens & codes of practice. So management and governance of such data will be a key concern to retain public confidence. Mentions Bermuda Principles (1996) & Fort Lauderdale principles (2003); need to involve researchers in all such discussions,

Agrees with coordinated approach to preserving key research data & ensure long-term value; must meet needs of researchers & funders. Devil is in the detail; must develop in way that truly adds value; must link effectively with existing activity & expertise; will need buy-in of ALL major funders & stakeholders; will need to accommodate differences in approach between funders & disciplines; & must appropriately resource initiative.

Would like clarification of detail on Pathfinder study, but like to see it go forward; full specification, but with careful assessment of funding implications.

Q&A session. Malcolm Atkinson again. PDB took place initially at national lab at Brookhaven; perhaps the researchers had a bit of time. Those who put together the data for re-use don’t get an immediate reward. Those who get the immediate reward don’t pay attention to the long term but very focused on immediate need. PhD skills means PhD courses take longer. Where does the extra resource come from to invest in this change in researchers, and how to invest in incentive (promotion, RAE etc) to make sure this happens? John Coggins: it’s going to be hard. Was working at Brookhaven at the time of PDB, did manage to publish wee paper, but true, none of those involved are now household names. Focus on VFM from government doesn’t help as too short term. David Lynn: 3 ways to recognise work: institutions themselves think about their own promotion criteria, FCs think about REF and consider metrics for such activities other than direct research; research funders now requiring data management plans.

Michael Jubb from RIN: Tension between disciplinary difference versus one size fits all. Coggins: consistency means common attitude that it’s worth sharing data and accessible; fierce arguments on data structures etc; seems to be blaming arts & humanities for being poorly funded and behind the digital times (CR: largely not fair!). Lynn: engage with researchers not in the abstract.

Bryan Lawrence again: no new money just old money re-purposed. Always tension between research, teaching & data, and only 1 pot for all of it, so where decision point lands will differ.

Peter Burnhill, EDINA: statement makers versus artisans? Observing what’s happening, can assuming the internet is always on. Need to think about what it means to publish data which includes archival issues. Not just take away, but software access. Variety out there, will be drivers, researchers are driven. What constitutes-well-published data including some notion of designated community to which you are publishing. Lynn, funders are increasing expectation on what researchers do in this respect, but can only drive this so far. Wellcome want researchers to tell us in DM plans what their approach will be to curating the data, making it available (including when); this DM plan is then peer-reviewed (larger grants, population studies etc) as part of the selection process. Coggins: well run labs with careful research systems means data better documented. Burnhill, provenance an issue, yes. Need to understand the drivers for someone wanting to re-analyse someone else’s data.

Kevin Schurer again: reflection on Qs so far. Making fruits of research available not a new conundrum, RS celebrating 350 years. The ways this is done very segmented: researcher, publisher, copy-editor; very segmented process. Problem with data (varies from disciplines): PI often the data generator, publisher & often distributor as well. UKRDS provides possibility to separate those things out again. Agreement from panel.

Wednesday, 18 February 2009

Notes from NERC Data Management workshop 1

David Bloomer, NERC CIO (and Finance Director) opened the workshop, and talked about data acquisition, data curation, data access, and data exploitation, in the context of developing NERC Information Strategy. Apparently NERC does not currently have an Information Strategy, as the last effort was thrown out in Council. Clearly from his point of view, the issue was about working out whether data management is being done well enough, and how it can be done better within the resources available.

There were some interesting comments about licensing and charging: principles that I summarise as:

encouraging re-use and re-purposing,
all free for teaching and research,
that the licence and its cost depends on the USE and not the USER, and
that NERC should support all kinds of data users equally.

Not sure yet the full implications of this; it clearly doesn’t mean that all data is freely accessible to everyone! However, it sounds like a major improvement over recent practices, with some NERC bodies charging high prices for some of their data.

In the second session, after my talk which I have already blogged (although not the 20 minutes of panic when first the PowerPoint transferred from my Mac would not load up on Windows, and second my Mac would not recognise the external video for the projectors!), there were two short talks by Dominic Lowe from BADC on Climate Science Modelling Language, and Tim Duffy from BGS on GeoSciML.

My notes on the former are skeletal in the extreme (ie nonsense!), other than that it is an application of GML, and is based on a data model expressed with UML. However, I picked up a bit more about the latter.

The scope of GeoSciML is the scientific information linked to the geography, except the cartography. It is mainly interpretive or descriptive, and so far includes 25 internationally-agreed vocabularies. Taken 5 years or so to develop to this point. Based around GML, using UML for modelling, using OGC web services to render maps, format, query etc. Provides interoperability and perhaps a “unified” view, but does not require change to local (end) systems, nor to local schemas. Part of the claimed value of the process was exposing that they did not understand their data as well as they thought!

Tuesday, 17 February 2009

Data Curation for Integrated Science: the 4th Rumsfeld Class!

I had to talk today to a workshop of NERC (Natural Environment Research Council) Data Managers, on data curation for integrated science. Integrated science does seem to be a term that comes up a lot in environmental sciences, for good reasons when contemplating global change. However, there doesn’t seem to be a good definition of integrated science on the web, so I had to come up with my own for the purposes of the talk: “The application of multiple scientific disciplines to one or more core scientific challenges”. The point of this was that scientists operating in integrated science MUST be USING unfamiliar data. Then the implications for data managers, particularly environmental data managers, were that they must make their data available for unfamiliar users. What does this imply?

By some strange mental processes, and a fortuitous Google search, this led me to the Poetry & Philosophy of Donald H Rumsfeld, as exposed to the world by Hart Seely, initially in Slate in April 2003, and now published in a book, “Pieces of Intelligence”, by Seely. These poems are well worth a look for their own particular delight, but the one I was looking for you will probably have heard of in various guises, The Unknown:

‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing

Now this insightful (!) poem (set to music by Bryant Kong, available at http://www.stuffedpenguin.com/rumsfeld/lyrics.htm) perhaps defines 3 epistemological classes:

Known knowns,
Known unknowns, and
Unknown unknowns

Logically there should be a 4th Rumsfeld class: the unknown knowns. And I think this class is especially important for data management for unfamiliar users.

The problem is that in many research projects, there are too many people who “know too much”; with so much shared knowledge, much is un-documented. In OAIS terms, we are looking here at a small, tight Designated Community with a shared Knowledge Base, and consequently little need for Representation Information. In integrated science, and particularly environmental sciences, as the Community broadens and time extends, effectively the need for Representation Information increases. I’m using the terms very broadly here, and RepInfo can be interpreted here in many different ways. But the requirement is to make more explicit the tacit knowledge, the unknown knowns implicit in the data creation and acquisition process.

Interestingly, subsequent speakers seemed to pick up on the idea of making explicit the unknown knowns, so maybe the 4th Rumsfeld Class is here to stay in NERC!

Saturday, 13 December 2008

What makes up data curation?

Following some discussion at the Digital Curation Conference in Edinburgh, how about this:

Data Curation comprises

Data management
Adding value to data
Data sharing for re-use
Data preservation for later re-use

Is that a good breakdown for you? Should data creation be in there? I tend to think that data creation belongs to the researcher; once created, the data immediately falls into the data management and adding value categories.

Adding value will have many elements. I’m not sure if collecting and associating contextual metadata is part of adding value, or simply good data management!

Heard at the conference: context is the new metadata (thanks Graeme Pow!).

Wednesday, 19 November 2008

Research Data Management email list created

At JISC's request, we have created a new list to support discussion amongst those interested in management of research data. The list is Research-Dataman@jiscmail.ac.uk, and can be joined at http://www.jiscmail.ac.uk/lists/RESEARCH-DATAMAN.html. The list description is:

"List to discuss the data management issues arising in and from research projects in UK Higher Education and its partners in the UK research community and internationally, established by the Digital Curation Centre on behalf of the JISC."

I was specifically asked not to include the word "curation" in the description, in case it frightened people off! And yes, it's already been hinted to me that the list-name is sexist. Honest, it really was just short for research data management. But you knew that (;-)!

Not much traffic yet, but this is the first time it has been publicised. Please spread the word!

Digital Curation Blog