Tuesday, 14 October 2008

ARROW Repositories day: 2

Lynda Cheshire speaking as part of “the researcher’s view”, talking about the view from a qualitative researcher working with the Australian Social Science Data Archive (ASSDA), based at ANU, established 1981, with about 3,000 datasets. Most notable studies election studies, opinion polls and social attitudes surveys, mostly from government sources. Not much qualitative data yet, but have grants to expand this, including new nodes, and the qualitative archive (AQuA) to be at UQ. Not just the data but tools as well, based on existing UK and US qualitative equivalents.

Important because much qualitative data is held by researchers on disk, in filing cabinets, lofts, garages, plastic bags! Archiving can support re-use for research, but also for teaching purposes. Underlying issues (she says) are epistemological and philosophical. Eg quantitative about objective measurements, but qualitative about how people construct meaning. Many cases (breadth) vs few cases (depth). Reliability vs authenticity. Detached vs involved.

Recent consultation through focus groups: key findings included epistemological opposition to qualitative archiving (or perhaps re-use), because of loss of context; data are personal and not to be shared (the researcher-subject implied contract); some virtues of archiving were recognised; concerns about ethical/confidentiality challenges; challenges of informed consent (difficult as archiving might make it harder to gather extremely sensitive data, but re-use might avoid having to interview more people about traumatic events); whose data is it (the subject potentially has ownership rights in transcripts, while the researcher’s field notes potentially include personal commentary); access control and condition issues; additional burden of preparing the data for deposit.

The task ahead: develop preservation aspects (focus on near retirees?), and data sharing/analysis under certain conditions. Establish protocols for data access, IPR, ethics etc. Refine ethical guidelines. Assist with project develop to integrate this work.

Ashley Buckle from Monash on a personal account of challenges for data-driven biomedical research. Explosion in amount of data available. Raw (experimental) data must be archived (to reproduce the experiment). Need for standardised data formats for exchange. Need online data analysis tools to go alongside the data repositories. In this field, there’s high throughput data, but also reliable annotation on low volume basis by expert humans. Federated solutions as possible approaches for Petabyte scale data sets.

Structural Biology pipeline metaphor; many complex steps involved in the processes, maybe involving different labs. Interested in refolding phase; complex and rate-limiting. They built their own database (REFOLD), with a simple interface for others to add data. Well-cited, but few deposits from outside (<1%). Spotted that the database was in some ways similar to a lab note-book, so started building tools for experimentalists, and capture the data as a sideline (way to go, Ashley!). Getting the data out of journals is inadequate. So maybe the journal IS the database? Many of the processes are the same.

Second issue: crystallography of proteins. Who holds the data? On the one hand, the lab… but handle it pretty badly (CDs, individuals’ filestore, etc). Maybe the Protein Data Bank? But they want the refined rather than the raw data. Maybe institutional libraries? TARDIS project providing tools for data deposit and discovery, working with ARCHER, ARROW and Monash library... This field does benefit from standards such as MIAME, MIAPE etc, which are quite important in making stuff interoperable. Ashley's working with Simon Coles etc in the UK (who's mostly at the small molecule end).

So how to go forward? Maybe turning these databases into data-oriented journals, with peer review built in etc would be a way to go? Certainly it's a worry to me that the Nucleic Acids field in general lists >1,000 databases; there has to be a better way than turning everything into Yet Another Database...


  1. It was great to hear more about the Monash Library/Ashley Buckle collaboration - a successful partnership between library and subject specialists. Do you think this is a case which worked in a particular context or illuminates a more generalisable model (or both, or something else)? I wonder about the degree to which university libraries are setup to know what needs to be known to manage a range of domain data over time. I imagine that as well as technology and policy, data management requires knowledge of data and documentation standards (where they exist) at a level like that of Alma Swan's classification of a data scientist - and that it must also be of assistance to know about the processes involved in creating/collecting/generating this data.

    At the library where I work (University of Sydney) we have several librarians with research degrees who have a high level of domain knowledge, who contribute to collaborative library/academic projects and who guide librarians who have lesser levels of subject knowledge. We also have metadata/cataloging staff who know a great deal about a particular documentation standard and its relatives, and our library and repository systems are based on MARC and DC. The repository primarily contains electronic versions of traditional library formats such as articles and books which are suitable for description using familiar documentation standards. For the library system we source collection materials which have already been subject to an external and trusted review process. We have a substantial (primarily humanities) electronic text collection, and this rests on one encoding standard (TEI).

    Maintenance of related standards, services, collections and access mechanisms costs a lot and is primarily about delivery of broad-based services rather than specialist infrastructure. Although we've worked collaboratively and successfully with academics on digital projects using 'unfamiliar' documentation and encoding standards, this has required considerable investment on the part of the library.

    I can picture domain-specific data management services supported by subject specialists (some of whom may be 'scholar librarians'), and I could also see my library deciding to take (for whatever reason) a decision to invest in skills development/aquisition in a particular domain to enable curation of its data. I find it more difficult to imagine what a generic/institutional data archive would look like. If I were to imagine a library-managed service at Sydney, it would rest on an assurance of access to practitioners with required subject-based data management skills. If we're going to call something an archive, this assumes that we're going to invest in order to know what we think we need to know to manage access over time. In the case of the library system and repository, we think we know what we need to know,but for data management across domains I think we'd need to know a whole lot more.

  2. Rowan, on your specific question on the Buckle collaboration: I think this kind of repository/researcher collaboration is essential for proper curation of data. You absolutely need the participation of the domain scientists, to understand the requirements of the data, and the domain metadata that's required. Library metadata has its place and role, but domain metadata is essential for re-use. However, the details must vary in every case.

    The big question, I think, is whether the Library (or whoever is running the data repository) has the will and can marshal the budget to enable and empower these collaborations. Monash did, and it sounds likely from your comment that Sydney is heading in that direction.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.