Digital Curation Blog: February 2009

Thursday 26 February 2009

UKRDS session 4

At last the big bucks: Paul Hubbard from HEFCE. Policy case for a national service: if it needs to be done, can be done, and they have the resources, then they will do it. National policy & funding picture: have had a very good ten years (CR: sounds ominous) for a world-class national research base. Note HEFCE now a minority funder of research; comes about 3rd in the ranking in UK. Can’t say HEFCE will fund this but can say HEFCE will work with others to fund it. Low note: recession ☹. HEFCE role: hands out money for research but doesn’t say what it’s for, celebrate that. Research base is competitive, dynamic & responsive, and efficient.

How? Excellent strong & efficient research information resources. Print services, collaborative collection management, national research libraries, BL Document Supply centre, UK Research Reserve (launched last week with £10M) building on it. Online resources: licensing EJs, JISC Infrastructure services and DCC as key player (hey, he said that!!!).

Making the case: HEFCE hopes to see a good, strong clear, specific case. Case made in general but not yet in detail. Drawing on experience with UK Research Reserve, need to show it meets a proven need in the research base, that data is being and will be produced that can be (and is) re-used, and that researchers will collaborate in ensuring data stored and demand access to it later. Need a consortium of funders, notably RCs, Health Services and major charities, for sustainability and efficiency. Need standards for what is kept, for how long and in what form, cataloguing and metadata, security, and scale. Key point, need to improve efficiency in research process, so need sound business case. Evidence for levels of current and foreseeable demand, evidence for savings from storing data that would otherwise have to be re-created later. Also the capital and recurrent costs of doing all this.

Need strong arrangements for management and future sustainability: ownership of storage, of the system, of the data. Who determines standards & procedures. Arrangements need to be designed with a good chance for lasting many years, and that the implementation is thoroughly planned. So good in principle, needs more detail, so, over to you (us!).

Malcolm Read, head of JISC. JISC already funds a lot of work in this area. Three areas:

technical advice and standards: DCC (he says he would not want to move the DCC “at this stage” into the fledgling UKRDS, thank goodness);
practitioner advice: UKRDS;
Repositories split two ways, subject repositories from RCs, but won’t be universal in UK, institutional repositories from HEIs maybe a SHED shared service option. (CR: what is this? Another shared data centre proposal? Maybe it’s about sharing the storage?)

UKRDS to give advice on

- What to save
- How to save it
- When to discard it, and
- Subject specialist.

Need some sort of horizontal vision across the infrastructure (CR, my orthogonal curation layer???) So more work to do:

- ownership/IPR (don’t care who but does care it’s unambiguous)
- how much to save
- capital (startup) costs - funding service
- recurrent costs – long term sustainability
- coherence across repositories (wants more assurance and comfort on this)
- scope for creating a data management profession (career opportunities, hierarchy, training & skills etc; cf learning technologists & digital librarians).

Governance options: depends on ownership: HEIs or FCs; responsibilities etc. Key is advisory rather than regulatory. Recommends: HEI owned, possibly in receipt of FC/JISC grant; initially a formal partnership of HEIs under an MOU; eventually a separate legal entity by subscription.

Q&A: CR on UK equivalent of Australian Code for Responsible conduct of research, UK equivalent much weaker, revision been in consultation but perhaps not much changed. Malcolm Read also impressed by the Australian code, and also by the German statement last year. Astrid Wissenburg from ESRC, says it’s more than just the Code. But also welcomes the collaborative approach, but wonders at the HEI-focused approach. Paul H expects links to RC-funded centres to be key. Sheila Cannell from Edinburgh on the suggested profession: does it emerge from researchers or from existing professions such as librarians, archivists, IT etc. Malcolm says: yes!

Simon Courage (?) Sunderland, from aspect of researcher recognition: will citations of data be counted in next REF. Paul Hubbard says short answer is no, as no clear way of doing it. Not willing to act as incentiviser of everything; should be done because it’s right for the science and required by the institution, and only enforced if nothing else works. Schurer: of 70 staff, only 2 from library backgrounds, so doesn't agree with Malcolm. Read, yes lots of people think that, not all of them are right (???). Mark Thorley from NERC: really need to make use of the knowledge and expertise in existing data work. (CR: library key because of organisational centrality/locus?).

Closing speech by David Ingram from UCL: summing up and moving forward. Science being transformed. Bio-informatics now core discipline of biology (this very room, 2005). Research & practice increasingly information intensive. Multiple legacy information systems in use; problems in supporting and linking health care, research and industry. Government creating pervasive new ICT infrastructure & core services. Other national/international initiatives creating relevant infrastructures & services. Talks about the MRC Data Sharing Service. Some nice Escher pics: order and chaos: use & re-use of data needs to be careful and context aware. Got to be very sensitive to the nature & provenance of the data.

Information explosion in health care: shows growth in SNOMED terminology to around 200K terms, and this is getting away from us. Too many ways of saying the same thing and too hard to work out that it’ so. Comment on importance of stopping people doing (unuseful) things.

Data explosion. Transferred at 100 MB/sec (??), a Petabyte takes 116 days, so need a van for your backup.

The circle of knowledge: are there logical relationship, epistemology about linking things across domains (CR: I may have missed the point there).

Dimensions of challenge

Diversity
Discipline and standards
Scale
Evolution over time
Willingness and capacity to engage (potential losers as well as winners)
Education and capacity at all levels
Implementation: be completely realistic

Dilemmas

Shareable vs shared; to do with motivation and practice improvement
Commonality vs diversity; requirements
Confidential/restricted vs public; about access management and control
Competition vs cooperation; modus operandi
Global vs local; implementation, capacity, organisation and resources
Standardised for general applicability vs optimised for particular purposes; what approach will work

Direction of travel: problem is urgent; research requirements should drive data standards and shared services; research communities must value and own the outcomes; practical experience in implementation (CR: read, existing data centres etc?) should guide policy, strategy & investment; and rigorously control scope, sclae and timing.

Zimmerli: IT as a horizontal technology that allows us to do previously impossible things.

Finishes with Escher picture Ascending and descending, to suggest a need for a constructive balance between top-down and bottom-up. Needs an experimental approach; will need to evolve. Times in 1834: unlikely the medical profession will ever use stethoscopes, as their use requires substantial training and involves great difficulties. UKRDS in this sort of area?

Martin Lewis closing: keep this on the agenda, can’t get everything right, so we need to start.

UKRDS session 3

Ross Wilkinson, Director of Australian National Data Service (ANDS),live via satellite (saving carbon, if not sleep) has been building on lot of work particularly in the UK. Role of data more etc; now answer questions unrelated to why the data were collected originally. Who cares and why varies a lot. Pleased that Australian Govt cares to extent of $24M. Institutions driven by the Code for responsible conduct of research (CR: blogged about this before): have to take data seriously. Not clear if disciplines care at the national level (some have cared at an international level). Researchers SHOULD care, but though he has been a researcher he hasn’t cared enough. So why should they? First the Code, second the role of data citation, 3rd changing nature of research.

The Code has obligations on the funders, the institutions AND the researchers. Probably a bit of a sleeper (researchers largely unaware yet), but funders are going to ratchet this up.

Fame: data citation. Sharing detailed research data associated with increased citation rate. This is about stories: Human Genome, Hubble telescope.

ANDS has 4 coordinated inter-related service delivery programs: frameworks, utilities, seeding the commons, and capability development. Plus service development funded through NeAT in partnership with ARCS (foreign acronym soup is so much less clear than one’s own!)

Frameworks on influencing national policies, eg count data citation rates just like publication citation rates; proposals should have slot for data space. Common understanding of data management issues, eg what is best practice. Don’t see much best practice; what is decent practice?

Utilities: discovery service (you come to us, we come to you flavours), persistent identifiers for deposited data, and collections registry.

Seeding the commons: improve & standardise institutional repositories; need routine deposit in sustainable infrastructures, etc (missed some).

Capability development: assist researchers to align data management practice, to improve quantity & quality of data, to do the right thing.

Not too much theory: build stuff. National collections registry service ready, work on persistent identifier and discovery services. Exemplar discovery spaces (water? Crystallography, law, biosecurity?). Advice not storage or management for data: this is done locally but networked. Layer on top of data.

Shows exemplar page on Dugongs, aquatic mammal found off the Australian coast. Might build linkages from data and researchers to allow browsing of the space. Using ISO 2146 to drive this: collections, parties, activities, services etc all need to be described, and the relationships matter. Not discovery but browsing service? Harvest info from a wide variety of sources beyond the data.

Talks about See Also service: other info to pop up (cf Amazon; other data come to you). Maybe could share such a service with the UK; if you’re interested in this then you might also be interested in this other stuff happening in the UK…

What do researchers get? Local managed data repository (from code); persistent identified data, and data discovery. Mission is more researchers sharing/re-using more data more often. Lower costs and raise benefits.

Q&A (so Ross can go home to bed): Robin Rice from Edinburgh: the Repository word, mny think about it in terms of publications, how does it differ? Problem word, with overloaded usage. Not necessarily a big issue; may use different underlying technology, but that’s not that important. Different tools and indexes maybe, but conceptually not that different.

Kevin Schurer again: Likes talk but fundamentally disagrees with one thing: quantity over quality? Our funders would prefer to ingest 10 datasets a year with huge use and relevance than 4,000 that don’t. It’s not really about quantity or quality but impact. Could be a data car-boot sale rather than data Sotheby’s. Ross likes an argument but does disagree. Eg Monash group with carefully crafted crystallography datasets, but want to get people through the door and then move up in quality, driven by discipline and community.

John Wood, what if institution doesn’t have capability to hold the data. Ross, many looking for funding to do so, but they could if they wanted enough. ANDS not the source of such funding.

Peter Burnhill again: mentioned Geospatial data, as in See also, could you say more about spatial infrastructure work going in and around ANDS. Ross: two areas: marine and earth sciences, with slightly different take on the problem. ANDS doesn’t mind which way wins, want to make sure that it’s engaged with an international perspective, particularly data that’s not intrinsically geo-referenced but geo-ref adds value, eg Atlas of Living Australia.

Mark Thorley from NERC again: any work to add value to data; good work on discovery and persistence, but they find researchers don’t want data in the form generated, but in a way more convenient for re-use. Who does this? Ross: mainly outside; have some funding for such developments but oriented to communities that want to use the data rather than providers. Two areas of struggle: data mining, how useful is that (versus model-based), secondly data visualisation. Brokering conversations but favouring community rather than technology.

Now closer to home, Carlos Morais Pires from EU perspective: capacity-building; domains of infrastructures, and then renewed forward strategy.

Capacities programme versus research programme: a new thing in FP7. What can e-Infrastructures (not ICTs) do for science? Parsimony: sharing resources; increase the user base; but… science is excellence. E-Science double face: e-infrastructures change science, also scientists change e-infrastructures. Shows a 5-layer model, e-Science on top of comms, on theory, on experiment, on ???

Domains: innovating science process: global virtual research communities; accessing knowledge (science data); sharing resources (grid); linking (Geant); designing future facilities (novel e-infrastructures). Aha the programme we heard of before: DEISA and petascale machine.

Information continuums most interesting for today’s conference: between past and present, between raw data and results, between science disciplines, between institutions, between research & education. Nice picture of an integrated view showing LHC, astronomy, biology and health/clinical data on same GRID (CR: be afraid?).

Looking ahead: continue dialogue with horizon of 2020+. Early calls in FP7 heavily over-subscribed by factors of 5 or 8, means very interesting proposals not funded. Some work still for next 2 years. New resolution hope will be adopted next week, and hope EU funding will reflect importance.

Stefan Winkler-Nees from DFG, German approaches and policies. All sitting in the same boat, facing the same problems. 2003, DFG signed OA Berlin Declaration. 2006, DFG position paper on funding priorities through 2015. Last year new national priority initiative on research data. Principle is that DFG tries to support scientists, bringing them together with information experts from libraries, archives and IT (didn’t say Computing Science). Criteria; substantial added value for supra-regional infrastructures, contributions to advancement of scientific infrastructure, science relevance, etc. Subtext: complex funding infrastructure; key is the alliance of German science organisations. June 22, 2008: national Priority Initiative “Digital Information”. Don’t forget legal frameworks.

Have working group on primary research data, drafting policy to promote the need for action and demonstrate usefulness.

Prime objective: transparency, traceability and follow-up use. Honour principles of safeguarding good scientific practice. Can publication of data become a self-evident part of the science recognition system?

What’s on the agenda: disciplines, working across sectors, encouraging publication of datasets, and exploring the international perspective.

Q&A again: Stephen Pinfield from Nottingham, interested in incentivising. Stick approach from funders, maybe other incentives, eg improved citations (CR, see http://digitalcuration.blogspot.com/2008/09/many-citations-flow-from-data.htm, quote "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets"). Schurer, what about the JISC Data Oscars; Malcolm Read maybe the data ostriches would be easier?

Burnhill again: reward system key, data must get some sort of recognition; lots of bad data just as bad papers in journals.

UKRDS Conference 2

John Coggins, VP for Life Sciences from Glasgow on researcher’s perspective. Wants to prompt us to think, about “culture change” and resources.

Volume & complexity of digital is rapidly increasing. In many fields data are rarely used outside lab/dept. Data management skills under-developed, research data often unstructured & poorly documented & inaccessible. Represents huge un-tapped resource. Never even fully analysed by researchers who generate it. Could be much added value from data mining & combining datasets.(CR from break conversation with Jmes Currall: need lots of additional tacit context knowledge to understand properly.) Most data stored locally and often lost. Most researchers say they are willing to share data, usually through informal peer exchange networks (CR: important distinction!). Some of those who deposit data regularly do so because journals won’t publish unless they do (CR: Bermuda agreement in bio-sciences; Alma Swan says compliance lower than expected).

Funders want to protect & enhance their investment by making data widely available,. Must be agreement among researcher on quality and format. Slow to achieve, cf story about Protein Data Bank. Story about surplus computers from particle physicists in the 1970s, playing with 7 protein structures; lots of subsequent arguments about the best way to store this stuff in those primitive computers; now 36 years on, 44,000 structures available, huge investment since then (eg early stuff re-examined in light of later advances). Essential to invest in data management, curation & storage as well as for easy access. Needs international collaboration & substantial funding for permanent delivery, eg EBI.

Researchers want confidence their data will be permanently stored & accessible; tht costs will be met; be able to access other people’s data, preferably on world-wide basis; and training.

Complications: commercially sensitive data, personal data, eg patient data, etc.

All Universities should have access to a UK-wide service. FCs and RCs likely to provide funding (goodwill but not yet much in £ notes). What about private sector? Charities? Other government departments like DEFRA (no-one from DEFRA in the room). These departments not celebrated for willingness to commit resources to research sector.

Way forward: research community would welcome UK-wide approach. Need more consistency, will take time. Researchers need to improve their skills for managing & using data. Significant building blocks in place. International dimension important. Culture change and mindset critical and must be gradually changed, needs investment in training & infrastructure, & spreading best practice. This is the way forward but won’t be easy.

David Lynn from Wellcome Trust (independent charity interested in biomedical research) for a funder’s perspective. (CR: is bio-science TOO good an example; poster child from which wrong inferences can be drawn?) Talking about size & rate of data, about integrating large datasets to gain insights into complex systems (CR: via Google Earth!!! Reminds me of NERC meeting last week again!) Now on immense potential to link research papers and data (CR: yeah!). Also growth in traffic of researchers accessing data from others (via web traffic on Sanger Institute site).

Meeting challenges: infrastructure: key data resources need coordinated and long-term sustainable funding. ELIXIR programme aiming to build sustainable infrastructure across Europe. Technical and cultural challenges: coordination & advocacy from key communities (funders, institutions, publishers etc). Data security challenge, important for some parts of bio-medical community. Recent high profile incidents in UK. Recent reports from Academy of Medical Sciences, Council of Sci & tech, Thomas/Walport review, US Inst of Medicine. Calls for safe havens & codes of practice. So management and governance of such data will be a key concern to retain public confidence. Mentions Bermuda Principles (1996) & Fort Lauderdale principles (2003); need to involve researchers in all such discussions,

Agrees with coordinated approach to preserving key research data & ensure long-term value; must meet needs of researchers & funders. Devil is in the detail; must develop in way that truly adds value; must link effectively with existing activity & expertise; will need buy-in of ALL major funders & stakeholders; will need to accommodate differences in approach between funders & disciplines; & must appropriately resource initiative.

Would like clarification of detail on Pathfinder study, but like to see it go forward; full specification, but with careful assessment of funding implications.

Q&A session. Malcolm Atkinson again. PDB took place initially at national lab at Brookhaven; perhaps the researchers had a bit of time. Those who put together the data for re-use don’t get an immediate reward. Those who get the immediate reward don’t pay attention to the long term but very focused on immediate need. PhD skills means PhD courses take longer. Where does the extra resource come from to invest in this change in researchers, and how to invest in incentive (promotion, RAE etc) to make sure this happens? John Coggins: it’s going to be hard. Was working at Brookhaven at the time of PDB, did manage to publish wee paper, but true, none of those involved are now household names. Focus on VFM from government doesn’t help as too short term. David Lynn: 3 ways to recognise work: institutions themselves think about their own promotion criteria, FCs think about REF and consider metrics for such activities other than direct research; research funders now requiring data management plans.

Michael Jubb from RIN: Tension between disciplinary difference versus one size fits all. Coggins: consistency means common attitude that it’s worth sharing data and accessible; fierce arguments on data structures etc; seems to be blaming arts & humanities for being poorly funded and behind the digital times (CR: largely not fair!). Lynn: engage with researchers not in the abstract.

Bryan Lawrence again: no new money just old money re-purposed. Always tension between research, teaching & data, and only 1 pot for all of it, so where decision point lands will differ.

Peter Burnhill, EDINA: statement makers versus artisans? Observing what’s happening, can assuming the internet is always on. Need to think about what it means to publish data which includes archival issues. Not just take away, but software access. Variety out there, will be drivers, researchers are driven. What constitutes-well-published data including some notion of designated community to which you are publishing. Lynn, funders are increasing expectation on what researchers do in this respect, but can only drive this so far. Wellcome want researchers to tell us in DM plans what their approach will be to curating the data, making it available (including when); this DM plan is then peer-reviewed (larger grants, population studies etc) as part of the selection process. Coggins: well run labs with careful research systems means data better documented. Burnhill, provenance an issue, yes. Need to understand the drivers for someone wanting to re-analyse someone else’s data.

Kevin Schurer again: reflection on Qs so far. Making fruits of research available not a new conundrum, RS celebrating 350 years. The ways this is done very segmented: researcher, publisher, copy-editor; very segmented process. Problem with data (varies from disciplines): PI often the data generator, publisher & often distributor as well. UKRDS provides possibility to separate those things out again. Agreement from panel.

UKRDS conference 1

(Andy Powell liveblogging this at http://tinyurl.com/dhfdbn)

Chaired by John Wood, who chairs the UKRDS Steering Committee, talking big numbers: 20 PByte per year from CERN, 200 PByte per year from some German project…

Jean Sykes, Project Director on the Challenge of Data Lifecycle Management.. The challenge includes the data deluge, but really it’s about research data as an untapped resource, due to lack of coherent/consistent policies & standards, not to mention infrastructure. Request to libraries for storage in repositories, no added value. But it’s the management of data through the whole lifecycle from creation through to re-use, with storage as one part, that is needed.

Cliff Lynch, Nature: “digital data are so easily shared and replicated and so recombinable, they present tremendous re-use opportunities…”

Four case studies: Bristol, Leeds, Leicester, Oxford (doing their own study at the same time).

Survey results: data volume 360% in 3 years; 50% of data has useful life of up to 10 years, 26% indefinite retention value. Most data held locally (often precarious). 43% think their research could be improved by access to more data.

DCC Lifecycle model “provides a useful standard for data management”. DRAMBORA gets a mention too. JISC Information Environment & JANET provide the infrastructure. JISC & RIN studies provide context.

So a more coherent UK-wide approach is feasible & desirable.

US model: NSF Datanets $100M over 5 years, second round in process. Australia, centralised top-down (but also distributed??).

UKRDS to address sustainability, not just about storage. Many building blocks in place but gaps to be filled, so embrace rather then replace existing facilities, to leverage research value.

Next we have John Milner on the Pathfinder approach. Aim to build a fully functional service but with only a limited population of stakeholders. Comes up with that boil the ocean metaphor that I’m sorry, I hate! Hope is we can build something that scales; something more than a conventional pilot. So start and continue, rather than start and re-start. Pathfinders will be the case studies plus Cardiff, Glasgow & Edinburgh (gets 3 HE funders on board).

Service requires clear expectation for both client and server on what it offers and what it doesn’t. Hmmm unreadable stakeholder diagram, maybe in the notes. Segmented into early and later adopters. Another diagram: simplified view of research service and research data sharing process, plus UKRDS services and administration. Lots of this is advisory, policy, strategy etc. Not much if anything I can see about storage or curation. I’m not quite getting it yet; no long term without short term!!! Front office/back office division; maybe the back office does the hard stuff?

Diagram emerges showing overlap between UKRDS, DCC and RIN. Either alarming or comforting for us, depends how you read it! Also nominate UKDA to help.

DCC & UKDA can provide infrastructure & tools… RIN with ongoing requirements planning &QA…

Summing up as feasible… but not yet addressed where the CURATION is done!

Now Q&A… John Wood on EU 5th freedom: freedom of knowledge!

Malcolm Atkinson, e-Science envoy: glad to see integrated approach, but bit worried maybe arbitrary boundary. Value of data limited if can’t compute on it. Need to think carefully about computational access at same time as data management; hoped to see linkage with e-Science such as NGS; integrated approach necessary. John Wood agrees; EU programme called PREIS??? Always 1 world-leading computer in EU though may move around.

Kevin Schurer, UKDA still confused about process; second slide on “demand-led”. Something missing: the supply; have we over-estimated relationship between demand and supply. 5 Veg a day example; people say yes but don’t do it. Similar for research data: agree they should share but don’t do the necessary: documentation, metadata, etc, so getting useful data out of researchers is hard (this is bitter experience talking!). Do local institutions have enough skill to support this? JM answers: demand is for the service, reservations are right, but focus is on gaps. Primary requirement is for researchers to do coherent data management plans so that’s where we start. Execution of DM plans another thing? JS: we don’t know until we try it. (CR surely part of the answer is building it into the research process through automation?) Kevin’s response: fear is 4 institutions not the most typical in the country, so can it scale up to the wider group? Leicester as the non-Russell Group uni.

Mark Thorley, NERC, and OECD WG on data management guidelines. Solution fundamentally flawed? NERC spends more in 1 year than UKRDS on 5 years, but still only scratching the surface. Naïve on what can be done. Need to talk now in DETAIL on people with real experience of difficulties of getting researchers engaged in process. Somewhat fudgy answer: have to start somewhere.

Bryan Lawrence, Director of Environmental archival: miss Mark’s point: worth doing but still hopelessly naïve, and frightened by the lack of relaisation that first attempt is bound to be wrong and may need to be thrown away and re-done. JM answer: point is that commitments to researchers have to be honoured. BL: make sure the scope of what you’re committing reflects the scope of what you can do.

Monday 23 February 2009

Repositories and preservation

I have a question about how repository managers view their role in relation to long term preservation.

I’m a member of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (hereafter BRTF). At our monthly teleconference last week, we were talking about preservation scenarios, and I suggested the Institutional Repository system, adding that my investigations had shown that repository managers did not (generally) feel they had long term preservation in their brief. There was some consternation at this, and a question as to whether this was based on UK repositories, as there was an expressed feeling that US repositories generally would have preservation as an aim.

My comment was based on a number of ad hoc observations and discussions over the years. But more recently I reported in an analysis of commentary on my Research Repository System ideas on discussions that had taken place on Ideascale last year, during preparatory work for a revision of the JISC Repositories Roadmap.

In this Ideascale discussion, I put forward an Idea relating to Long Term preservation: “The repository should be a full OAIS preservation system”, with the text:

“We should at least have this on the table. I think repositories are good for preservation, but the question here is whether they should go much further than they currently do in attempting to invest now to combat the effects of later technology and designated community knowledge base change...”

See http://jiscrepository.ideascale.com/akira/dtd/2276-784. This Idea turned out to be the most unpopular Idea in the entire discussion, now having gathered only 3 votes for and 16 votes against (net -13).

Rather shocked at this, I formulated another Idea, see http://jiscrepository.ideascale.com/akira/dtd/2643-784: “Repository should aspire to make contents accessible and usable over the medium term”, with the text:

“A repository should be for content which is required and expected to be useful over a significant period. It may host more transient content, but by and large the point of a repository is persistence. While suggesting a repository should be a "full OAIS" has not proved acceptable to this group so far, investment in a repository and this need for persistence suggest that repository managers should aim to make their content both accessible and usable over the medium (rather than short) term. For the purposes of this exercise, let's suggest factors of around 3: short term 3 years, medium term around 10 years, long term around 30 years plus. Ten years is a reasonable period to aspire to; it justifies investment, but is unlikely to cover too many major content migrations.

“To achieve this, I think repository management should assess their repository and its policies. Using OAIS at a high level as a yard stick would be appropriate. Full compliance would not be required, but thought to each major concept and element would be good practice.”

This Idea was much more successful, with 13 votes for and only one vote against, for a net positive 12 votes. (For comparison, the most popular Idea, “Define repository as part of the user’s (author/researcher/learner) workflow” received 31 votes for and 3 against, net 28.)

Now it may be that the way the first Idea was phrased was the cause of its unpopularity. It appears that the 4 letters OAIS turn a lot of people off!

So, here are 3 possible statements:

1) My repository does not aim for accessibility and/or usability of its contents beyond the short term (say 3 years)
(http://jiscrepository.ideascale.com/akira/dtd/14100-784 )

2) My repository aims for accessibility and/or usability of its contents for the medium term (say 4 to 10 years)
(http://jiscrepository.ideascale.com/akira/dtd/14101-784 )

3) My repository aims for accessibility and/or usability of its contents for the long term (say greater than 10 years).
(http://jiscrepository.ideascale.com/akira/dtd/14102-784 )

Could repository managers tell me which they feel is the appropriate answer for them? Just click on the appropriate URI and vote it up (you may have to register, I’m not sure).

(ermmm, I hope JISC doesn’t mind my using the site like that… I think it’s within the original spirit!)

(This was also a post to the JISC-Repositories list)

Wednesday 18 February 2009

Notes from NERC Data Management workshop 1

David Bloomer, NERC CIO (and Finance Director) opened the workshop, and talked about data acquisition, data curation, data access, and data exploitation, in the context of developing NERC Information Strategy. Apparently NERC does not currently have an Information Strategy, as the last effort was thrown out in Council. Clearly from his point of view, the issue was about working out whether data management is being done well enough, and how it can be done better within the resources available.

There were some interesting comments about licensing and charging: principles that I summarise as:

encouraging re-use and re-purposing,
all free for teaching and research,
that the licence and its cost depends on the USE and not the USER, and
that NERC should support all kinds of data users equally.

Not sure yet the full implications of this; it clearly doesn’t mean that all data is freely accessible to everyone! However, it sounds like a major improvement over recent practices, with some NERC bodies charging high prices for some of their data.

In the second session, after my talk which I have already blogged (although not the 20 minutes of panic when first the PowerPoint transferred from my Mac would not load up on Windows, and second my Mac would not recognise the external video for the projectors!), there were two short talks by Dominic Lowe from BADC on Climate Science Modelling Language, and Tim Duffy from BGS on GeoSciML.

My notes on the former are skeletal in the extreme (ie nonsense!), other than that it is an application of GML, and is based on a data model expressed with UML. However, I picked up a bit more about the latter.

The scope of GeoSciML is the scientific information linked to the geography, except the cartography. It is mainly interpretive or descriptive, and so far includes 25 internationally-agreed vocabularies. Taken 5 years or so to develop to this point. Based around GML, using UML for modelling, using OGC web services to render maps, format, query etc. Provides interoperability and perhaps a “unified” view, but does not require change to local (end) systems, nor to local schemas. Part of the claimed value of the process was exposing that they did not understand their data as well as they thought!

Tuesday 17 February 2009

Data Curation for Integrated Science: the 4th Rumsfeld Class!

I had to talk today to a workshop of NERC (Natural Environment Research Council) Data Managers, on data curation for integrated science. Integrated science does seem to be a term that comes up a lot in environmental sciences, for good reasons when contemplating global change. However, there doesn’t seem to be a good definition of integrated science on the web, so I had to come up with my own for the purposes of the talk: “The application of multiple scientific disciplines to one or more core scientific challenges”. The point of this was that scientists operating in integrated science MUST be USING unfamiliar data. Then the implications for data managers, particularly environmental data managers, were that they must make their data available for unfamiliar users. What does this imply?

By some strange mental processes, and a fortuitous Google search, this led me to the Poetry & Philosophy of Donald H Rumsfeld, as exposed to the world by Hart Seely, initially in Slate in April 2003, and now published in a book, “Pieces of Intelligence”, by Seely. These poems are well worth a look for their own particular delight, but the one I was looking for you will probably have heard of in various guises, The Unknown:

‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing

Now this insightful (!) poem (set to music by Bryant Kong, available at http://www.stuffedpenguin.com/rumsfeld/lyrics.htm) perhaps defines 3 epistemological classes:

Known knowns,
Known unknowns, and
Unknown unknowns

Logically there should be a 4th Rumsfeld class: the unknown knowns. And I think this class is especially important for data management for unfamiliar users.

The problem is that in many research projects, there are too many people who “know too much”; with so much shared knowledge, much is un-documented. In OAIS terms, we are looking here at a small, tight Designated Community with a shared Knowledge Base, and consequently little need for Representation Information. In integrated science, and particularly environmental sciences, as the Community broadens and time extends, effectively the need for Representation Information increases. I’m using the terms very broadly here, and RepInfo can be interpreted here in many different ways. But the requirement is to make more explicit the tacit knowledge, the unknown knowns implicit in the data creation and acquisition process.

Interestingly, subsequent speakers seemed to pick up on the idea of making explicit the unknown knowns, so maybe the 4th Rumsfeld Class is here to stay in NERC!

Friday 13 February 2009

Repositories and the web

There has been an interesting series of posts on repository architectures, implementations and implcations, by Andy Powell on the eFoundations blog. He started wondering if multiple deposit diminished Googlejuice, and when I used an article of mine in comments, to illustrate that Google did cope better than expected (in some cases, at least, we were to discover), he had a look at the relevant DSpace-based repository and reported his horror at what he saw. A later post looking at an ePrints-based repository showed continuing concern, although slightly less so. Both posts attracted multiple comments, including an interesting extempore sketch of an OAI-ORE-based solution to one identified problem, by Herbert van der Sompel.

The discussion then kicked across into a closed JISC repository discussion list. It's not fair to report that conversation here, but I can at least quote from one of my comments:

"Part of Andy's original point (he's being saying this a lot), was that the repository platforms we use (DSpace, ePrints et al) are not good web-natives. He did a second post looking at [an] ePrints-based site [...], which showed some improvement over DSpace, but some related issues.

[...] Let's remember the list of problems that Andy spotted (in only a cursory examination; there may be more). He's at the DSpace splash page for my item:
a) PDF instead of (or not as well as) HTML for the deposited item
b) confusion over 4 different URIs
c) wrong Title, so Delicious etc won't work well
d) no embedded metadata in the HTML
e) no cross-linking of keywords (cf Flickr tags)
f) ditto author, publisher
g) unhelpful link text for PDF and additional material etc.
I think if those responsible for these repositories were building them today as web sites, they would not have constructed them with (all of) those deficiencies. (Some are, I think, matters of design choice, or at least in the vanguard of web thinking, eg (e).) But the repository platform brings (or is intended to bring) advantages of workflow, scalability, additional features such as OAI-PMH if it is actually useful, etc. It's supposed to save the repository manager the effort of building a purpose web site, and to provide some consistency across different repositories, as opposed to the general confusion of University web sites in general.

So one approach, as I suggested in an earlier email, is simply to invest to improve the repository platforms, so that they are better web natives, using up to date technologies, and so present a more appropriate web presence. Many of those issues could be solved with some software effort, benefiting everyone who uses DSpace (and similar effort could improve ePrints etc). Just add money and time.

But the multi-URI thing is a bit more of a problem; it sounds like the repository model we use is conceptually broken here. Herbert's solution sounds like a major design change for repository world. Move from depositing items in a FRBR sense (I know Andy has some issues with this) to depositing some thing much more akin to works, instantiated as Resource Maps. I think this is sufficiently different that there would be some serious debate about it. It certainly sounds like it would make sense, although I'm not sure how much good it will do in Googlejuice terms (Google dropped the OAI-PMH maps, so no huge reason to believe they will be interested in OAI-ORE Resource Maps per se, although this may not be the issue).

It would be very difficult for repository managers to craft those Resource Maps by hand in today's world (although clearly Oxford is doing SOMETHING with them). However, it does sound like the sort of thing that repository platforms could do fairly automatically for you. But again there would be significant development work involved. So in this case, we first have an effort to refine Herbert's sketch to a shared view across the repository world, followed by an effort to implement in major platforms. So add money twice, and more time.

Of course, Herbert's probably built it by now!

So my proposal to JISC would be:
invest in upgrading the common UK repository platform software for better web-nativeness (!)
invest in an effort along the lines of previous SWAP/CRIG approaches to get consensus in the Resource Map approach
invest in upgrading the common UK repository platform software to support this approach."

I'm pretty sure some of these issues are also part of what reduces the usefulness of these platforms for data sharing.

Open Office as a document migration on demand tool- again

We’ve seen suggestions in comments on this blog, and on other blogs, that code is better than specifications as representation information, and that well-used, running open source code is better than proprietary code. We’ve also had assertions that documents should be preserved in their original format, rather than migrated on ingest (I’ve some reservations on this in some cases for data, but as long as the original form is ALSO preserved, it’s fine).

The appropriate strategy for documents in obsolete formats would therefore seem to be to preserve in the original format and migrate on demand, from the original format to a current format, when an actual customer wants to use it. This process should always be left as late as possible, based on the possibility that the migration tool will improve and render the document better with later versions (and allowing the cost to be placed onto the user, not the archive, if appropriate). By the way, this exactly parallels the case in real archives; they don't translate their documents from old Norse to modern English each time the latter changes. If you want to read them, go learn old Norse, or hope that someone has earned a brownie point by translating it for publication somewhere...

I have suggested a couple of times that a plausible hypothesis for “office documents” (ie text documents, spreadsheets, simple drawings, presentations, simple databases), is that the OpenOffice.org should be the migration tool of choice. After all, it supposedly reads files in 180+ different file formats, it is open source, it is widely used, it is actively developed, and it can produce output in at least one internationally standardised format. I’ve noted already that it isn’t perfect; the Mac version, for instance, fails to open my problem PowerPoint version 4 files (to be fair, it doesn’t claim to). But perhaps it’s worth taking a look again at the range of formats it claims to deal with. All these figures relate to vanilla OpenOffice.org 3.0.0.0 (build 9358) for the Mac (the first fully native, up to date Mac version I've been able to lay my hands on).

So, a health warning: this rather long post goes into more detail on something I've covered before!

The single OpenOffice.org application opens 6 main classes of document: text document, spreadsheet, presentation, drawing, database and formula (the lists of file formats has moved to an appendix of the Getting Started guide). In each case, a majority (or close to it) of the supported formats are OpenOffice native, or its predecessors, plus a good selection of Microsoft Office formats (Office 6.0/95/97/XP/2000 and the XML versions from Office 2003 and Office 2007), which means that the large majority of documents for the past 10 years or so should be readable, when these formats have been the dominant office suites

The most well-known (and presumably widely-used) remaining supported word processing format is WordPerfect; not clear which versions (.wpd). Then there are some interesting ones: for example DocBook, and the Chinese-developed Uniform Office Format, the Korean Hangul WP 97, AportisDoc for the Palm, Pocket Word, and the Czech T602. Interesting that: significant investment to ensure these minority but presumably significant formats can be handled.

Similarly for the spreadsheets, as well as the various native and Microsoft formats, there is also support for two earlier significant players: Lotus 1-2-3 and Quattro Pro 6.0, also dBase (.dbf) and Data Interchange Format (DIF), not to mention CSV. It’s interesting that dBase is treated as a spreadsheet rather than a database; I wonder what the limitations are.

Presentations are more limited; apart from the basic OpenOffice and MS Office variants, they include Computer Graphics Metafile (CGM) as a presentation format but not a drawing format, which is a bit odd. PDF is also included; well, it does do presentations, and they seem to have some advantages over PowerPoint.

Graphics formats have always been popular in the open source community, so it’s not surprising that a wide range of formats is supported for graphics. Aside from several OpenOffice formats, these include a few surprises such as AutoCAD’s DXF Interchange Format, and Kodak PhotoCD, as well as a large range of usual suspects (GIF, PNG, TIFF, JPEG, many bit-mapped formats).

Finally the only database supported is the OpenOffice native database (remembering that dBase is apparently supported, as a spreadsheet, presumably with limitations). I tried to open a Microsoft Access database from a previous computer (Win95?), without success. Old databases do tend to be a bit of a problem; I have heard there are significant compatibility problems even between successive versions of Access. And Formula supports a couple of OpenOffice formats, plus MathML, which should be good for scientific use today.

So, for nothing, you get a migration tool that deals with a substantial proportion of current or recent documents. I don’t have enough experience yet to judge how effective it is. I did try a trivial round-trip test: opening a Microsoft Word 2004 for Mac document in OpenOffice, saving as native OpenOffice, then re-opening and saving as Word again, followed by a document compare in Word; it revealed very small layout differences in nested bullets (which resulted in pagination changes), and a few minor changes in styles. Not quite a fully reversible migration, but the result was a perfectly acceptable rendition of the original.

Now a migrate on demand tool is only useful in this role as long as (or if) the original file format is supported. If you are interested in older documents, from what one might call the baroque period of early personal and office documents (say from the invention of microcomputers for home use, through early “personal computers”, up to the big shakeouts of the mid to late 1990s), you will find OpenOffice rather less helpful as a migrate on demand tool. On one argument, this doesn’t matter much, as comparatively speaking such formats represent a tiny minority of surviving documents (unproven but pretty safe assertion!). However, this class of baroque period documents is starting to become important to archives (real archives, not collections of backups, or even digital preservation repositories), as they begin to collect them as part of the “papers” of eminent individuals. See for example, the Digital Lives project mentioned here before.

So, here are two proposals (for both of which, specifications as well as known working code would be useful!):

Funders, Foundations etc - please fund efforts to add input filters to OpenOffice for such older document formats, and
Computing Science departments - please set group assignments that would result in components of such filters being contributed to the OpenOffice effort.

Collectively, we might suggest the underlying effort here as an OpenOffice Legacy Files Project. Does anyone know how to set up such a project?

BTW after my last posting on this topic, a linkback led me to a post where Leslie Johnston mentioned Conversions Plus as having been a life-saver on several occasions. It’s a commercial tool, so maybe there are licence and survivability issues, but the list of formats it claims is impressive. In the Word Processing area alone, you get:

3 versions of Ami Pro
2 versions of AppleWorks
ClarisWorks 1.0 - 5.0
3 versions of MacWrite
DCA-RFT
3 versions of Multimate
Many versions pf Word, back to MS Word DOS 5.5
Several versions of MS Works
PerfectWorks
WordPerfect for DOS and Windows
WordPerfect Works
WordStar for DOS
Several versions of Lotus Word Pro

Is a tool like this a better bet than OpenOffice for migration on demand? In the longer term, I don’t think so, even if it might be more helpful in the short term. You’d have to be convinced that the company will still exist to supply it, and that it will still run on your then current hardware. It might, but the odds seem somewhat better for a very popular open source application like OpenOffice.

But in the end, one way or another, you pays your money and you place your bets!

Thursday 5 February 2009

A National Research Data Infrastructure?

Two weeks without a post? My apologies! How about this piece of speculation that has been brewing at the back of my mind for some time...

It is clear that a national research data infrastructure is needed, but there are problems with all of the approaches taken so far to address this. Subject data centres provide subject domain curation expertise, but there are scalability issues across the domain spectrum: it appears unlikely that research funders will extend their funding to a much larger set of these data centres (indeed the AHDS experience might suggest a concern to cut back). Institutional data repositories are being explored, but while disclosing institutional data outputs might provide sustainability incentives, and such data repositories might be managed at a storage level by developments from existing institutional library/archive and IT support services, it is difficult to see how domain expertise can be brought to bear from so many domains across so many disciplines. Meanwhile, various of the studies done by UKOLN/DCC with Southampton University suggest the value of laboratory or project repositories in assisting with curation in a more localised context.

To square this circle, perhaps we have to realise that the storage infrastructure and the curation expertise are orthogonal issues. It is reasonable to suggest that institutions, faculties, departments, laboratories or projects should manage data repositories, databases etc with varying degrees of persistence. But in terms of the curation of the data objects (ie aspects of appraisal, selection, retention, transformation, combination, description, annotation, quality etc), somehow expertise from each domain as a whole has to be brought to bear. It is tempting to think of this as a parallel with the "editorial board" function of a "virtual data journal". This would clearly only be scalable if it were managed across the sector, rather than individually for each data repository.

So we might suggest a federation of repositories on the one hand, and a collective organisation (or set of differing collective organisations) of curation expertise in different disciplines or domains on the other hand; the latter is referred to below as national curation mechanisms.

In such a system, we might see roles for some of the main stakeholders as follows:

Research funders define their policies and mandates, and their compliance mechanisms; Research Information Network to participate and assist?
Publishers likewise define their own policies and mandates with regard to data supporting publication; JISC and RIN could assist coordination here.
Research Institutions define their own policies, establish local research data infrastructure, and encourage appropriate researchers to participate in national curation mechanisms (could participation in these become an element of researcher prestige as membership of editorial boards currently is?).
Researchers are responsible for their own good practice in managing and curating their data (possibly in laboratory or project data repositories or databases), and where appropriate for participating in national curation mechanisms.
Subject/discipline domain data centres are responsible for managing and curating data in their domain, as required by their funders. They also assist in defining good practice in their domains and more widely, undertake community proxy roles (NSB, 2005), and participate in national curation mechanisms.
A number of bodies such as the Digital Curation Centre and the proposed UK Research Data Service could undertake some coordination roles in this scenario, and could also undertake good practice dissemination and skills development through knowledge exchange activities.

(NSB. (2005). Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century. Retrieved from http://www.nsf.gov/pubs/2005/nsb0540/)

What do you think? Is that at all plausible?

Digital Curation Blog