Tuesday, 6 October 2009

iPres 2009: Micah Altman Keynote on Open Data

Open Data is at the intersection of scientific practice, technology, and library/archival practice. Claims that data are at the nucleus of scientific collaboration, and data are needed for scientific replication. Science is not just scientific; it becomes science after community acceptance. Without the data, the community can’t work.

Open data also support new forms of science & education: data intensive science, which also promoted inter-disciplinarity. Open data also democratise science: crowd-sourcing, citizen science, developing country re-use, etc. Mentions Open Lab Notebook (Jean-Claude Bradley), Galaxy Zoo etc.

Open data can be scientific insurance; that little extra bit of explanation makes your own data more re-usable, and can give your project extended life after initial funding ends.

Data access is key to understanding social policy. Governments attempt to control data access “to evade accountability”.

Why do we need infrastructure? [Huh?] While many large data sets are in public archives, many datasets are hard to find. Even problems in professional data archives: links, identifiers, access control, etc. So, core requirements…

  • Stakeholder incentives
  • Dissemination inc metadata & documentation
  • Access control
  • Provenance: chain of control, verification of metadata & the bits
  • Persistence
  • Legal protection
  • Usability
  • Business model…

Institutional barriers: no-one (yet?) gets tenure for producing large datasets [CR: not sure that’s right, in some fields eg genomics etc data papers amongst highest cited]. Discipline versus institutional loyalties for deposit. Funding is always an issue, and potential legal issues raise their heads: copyright, database rights, privacy/confidentiality etc.

Social Science was amongst the first disciplines to establish shared data archives (eg ICPSR, UKDA etc), in the 1960s [CR: I believe as an access mechanism originally: to share decks of cards!]. Mostly traditional data, not far beyond quantitative data. More recently community data collections have been established, eg Genbank etc; success varies greatly from field to field. Institutional repositories mostly preserve outputs rather than data, and most only have comparatively small collections. They provide so far only bit-level preservation, mostly not designed to capture tacit knowledge, and have limited support for data. More recently still, virtual hosted archives are happening: institutionally supported but depositor-branded (?), eg Dataverse Network at Harvard; Data360, Swivel. Some of these have already gone out of business; what does that do to trust re persistence of service & data; can you self-insure through replication?

Cloud computing models are interesting, but mostly Beta, and often dead on arrival or soon after. What about storing data in social networks (which are often in/on the cloud). Mostly they don’t really support data (yet), but they do “leverage” that allegiance to a scientific community.

Altman illustrated a wide range of legal issues affecting data; not just intellectual property, but also open access, confidentiality, privacy, defamation, contract. Traditional ways of handling some of this was de-identification of data; unfortunately this is working less and less well, with several cases of re-identification published recently (eg Netflix problem, Narayan et al). [CR; refreshing to hear a discussion that is realistic about the impossibility of complete openness!]

So instead of de-identifying at the end, we’re going to have to build in confidentiality (of access) from the beginning! Current Open Access licences don’t cover all IP rights (as they vary so widely), don’t protect 3rd party liability, and often mutually incompatible.

Altman ending on issues at intersections, starting with data citation: “a real mess”. At least should be some form of persistent identifier. UNF as a robust, coded data integrity check (approximation, normalisation, fingerprinting, representation). Technology can facilitate persistent identifier [CR: not a technology issue!], deep citation (subsets), versioning. Scientific practices evolve: replications standards, scientific publications standards.

There is a virtuous circle here: publish data, get data cited, encourages more data publication and citation!

Next BKN, which sounds like a Mendeley/Zotero/Delicious-like system, transforming treatment of bibliographies & structured lists of information.

The Dataverse network: an open source, federated web 2.0 data network, a gateway to >35,000 social science studies. Now being extended towards network data. Has endowed hosting.

DataPASS, a broad-based collaboration for preservation.

Syndicated Storage Project: replication ameliorates institutional risk to preservation. Virtual organisations need policy-based, auditable, asymmetric replication commitments. Formalise these commitments, and layer on top of LOCKSS. Just funded by IMLS to take the prototype, make it easier to configure, open source etc.

Prognostication: archiving workflow must extend backwards to research data collection [CR: Yeah!!!]. Data dissemination & preservation increasingly hybrid approach. Strengthening links from publication to data, makes science more accountable. Effective preservation & dissemination is a co-evolutionary process: technology, institution & practice all change in reaction to each other!

Question: what do you mean by extending backwards? Archiving often captured when the research is done; becomes another chore, lose opportunity to capture context. So if the archive can tap into the research grid, the workflow can be captured in the archive.

Question (CR): depositor/re-user asymmetry? It does exist; data citation can help this!


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.