Digital Curation Blog: October 2008

Friday, 31 October 2008

Interactive Science Publishing

Open Access News yesterday reported the launch of a new DSpace-based repository for article-related datasets by the Optical Society of America. They're calling this "Interactive Science Publishing" (ISP). The catch? The repository requires a closed-source download to get hold of the data, and the use of further closed-source software to view/manipulate it.

So, one step forward, one step back? Read the original post here.

Tuesday, 14 October 2008

Open Access Day

It's Open Access Day today. I've been at the Australian ARROW Repositories Day, and no-one mentioned it.

Why do I care about Open Access? It's always seemed obvious to me since I first heard of the idea. I've never cared what colour it was, nor how it was done. But I've not published in a toll publication since that day, except for a book chapter where I reserved the rights and put it in the repository. It's fair, it's just; you pay my wages, and you should be able to know what you're getting. But it's easy for me; risk is low, career not at stake, and publishing this way is part of my job. I understand why people with more at stake are more cautious. It's a long road to full acceptance, and we're not doing badly.

Hobbes Leviathan: "the life of man, solitary, poor, nasty, brutish, and short". Could be a metaphor for open access repositories? Nah. On the other hand, if you start from there, hey, everything begins to look rosy (;-)!

Read some nice posts on various blogs about OA day, some quite long and well-written. I'm keeping this short and nearly empty, like so many repositories. Feels better that way. Sigh...

ARROW Repositories day: 3

Dr Alex Cook from the Australian Research Council (a money man! Important!) talking on the Excellence in Research Framework (ERA), the Access Framework and ASHER. ERA appears to be like the UK’s erstwhile RAE, and will use existing HE Research Data Collection rules for publication and research income information where possible. 8 clusters of disciplines have been identified. Currently looking at the bibliometric and other indicators which will be discipline-specific (principles, methodologies and a matrix showing which are used where). Developing the System to Evaluate Excellence of Research (SEER) that will involve institutions uploading their data; sounds like something that MUST (and they say will) work with repositories (RAE in UK too often went for separate bibliographic databases, not interoperable, which meant doing the same thing twice). Copyright still an issue (I wonder if they are brave enough to take a NIH mandate approach? See below… not so explicit, but some pressure that way). Where a research output is required for review purposes, institutions will be required to store and reference their Research Output Digital Asset. Repositories a natural home for this.

Accessibility framework requires research outputs to be made sufficiently accessible to allow maximum use of the research. ASHER a short term (2 year?) funding stream to help institutions make their systems and repositories more suitable for working in this new framework.

Andrew Treloar talking about ANDS, the Australian National Data Service, and its implications for repositories. Mentions again the Australian Code for the Responsible Conduct of Research. Institutions need to think about their obligations under this code, obligations that are quite significant! Good story about Hubble telescope: data must be made available at worst 6 months after capture; most of published data from Hubble is not “first use”! (Would this frighten researchers? I put all the work in, but someone else just does the analysis and gets the credit?)

Structure of ANDS: developing frameworks (eg encouraging moves towards discipline-acceptable default sharing practices), providing utilities (building and delivering national technical services to support the data commons, eg discovery, persistent identifiers, and collections registry), seeding the commons, and building capabilities, plus service development activities. ISO 2146 information model important here (I think I’ve already talked about this stuff from the iPres posts [update: no, it was from the e-Science All Hands Meeting, but the links still points to the right place]).

Australian Strategic Roadmap Review talks about national data fabric, based on institutional nodes, and a coordination component to integrate eResearch activities, and see expertise as part of infrastructure. There’s also a review of the National Innovation System: ensure data get into repositories, try to get data more freely available.

Implications for repositories: ANDS dependent on repositories, but doesn’t fund repositories! May need range of repository types for storing data. Big R and little r repositories: real scale issues in data repositories. Lots of opportunities for groups to take part in consultation etc. Different but related discovery service. Links to Research Information Systems. Persistent Identifier Service (in collaboration with Digital Education Revolution). (I worry about this: surely persistence requires local commitment, rather than remote services?)

Panel discussion… question about CAIRS, the Consortium of Australian Institutional Repositories, funding left over from ARROW, being run by CAUL, the Australian University Librarians, which now has an Invitation to Offer to find someone to run the proposed service.

Some questions about data, whether they should be in repositories or in filestore. Scale issues (numbers, size, rapidity of deposit and re-use etc) suggest filestore, but there are then maybe issues about integrity and authenticity. There will be ways of fixing these, but they imply new solutions that we don’t yet have. The data world will soon go beyond the current ePrint/DSpace/Fedora realms.

A few more questions, too hard to summarise, on issues such as the importance of early career scientists, on the nature of raw data etc. But overall, an extremely interesting day!

ARROW Repositories day: 2

Lynda Cheshire speaking as part of “the researcher’s view”, talking about the view from a qualitative researcher working with the Australian Social Science Data Archive (ASSDA), based at ANU, established 1981, with about 3,000 datasets. Most notable studies election studies, opinion polls and social attitudes surveys, mostly from government sources. Not much qualitative data yet, but have grants to expand this, including new nodes, and the qualitative archive (AQuA) to be at UQ. Not just the data but tools as well, based on existing UK and US qualitative equivalents.

Important because much qualitative data is held by researchers on disk, in filing cabinets, lofts, garages, plastic bags! Archiving can support re-use for research, but also for teaching purposes. Underlying issues (she says) are epistemological and philosophical. Eg quantitative about objective measurements, but qualitative about how people construct meaning. Many cases (breadth) vs few cases (depth). Reliability vs authenticity. Detached vs involved.

Recent consultation through focus groups: key findings included epistemological opposition to qualitative archiving (or perhaps re-use), because of loss of context; data are personal and not to be shared (the researcher-subject implied contract); some virtues of archiving were recognised; concerns about ethical/confidentiality challenges; challenges of informed consent (difficult as archiving might make it harder to gather extremely sensitive data, but re-use might avoid having to interview more people about traumatic events); whose data is it (the subject potentially has ownership rights in transcripts, while the researcher’s field notes potentially include personal commentary); access control and condition issues; additional burden of preparing the data for deposit.

The task ahead: develop preservation aspects (focus on near retirees?), and data sharing/analysis under certain conditions. Establish protocols for data access, IPR, ethics etc. Refine ethical guidelines. Assist with project develop to integrate this work.

Ashley Buckle from Monash on a personal account of challenges for data-driven biomedical research. Explosion in amount of data available. Raw (experimental) data must be archived (to reproduce the experiment). Need for standardised data formats for exchange. Need online data analysis tools to go alongside the data repositories. In this field, there’s high throughput data, but also reliable annotation on low volume basis by expert humans. Federated solutions as possible approaches for Petabyte scale data sets.

Structural Biology pipeline metaphor; many complex steps involved in the processes, maybe involving different labs. Interested in refolding phase; complex and rate-limiting. They built their own database (REFOLD), with a simple interface for others to add data. Well-cited, but few deposits from outside (<1%). Spotted that the database was in some ways similar to a lab note-book, so started building tools for experimentalists, and capture the data as a sideline (way to go, Ashley!). Getting the data out of journals is inadequate. So maybe the journal IS the database? Many of the processes are the same.

Second issue: crystallography of proteins. Who holds the data? On the one hand, the lab… but handle it pretty badly (CDs, individuals’ filestore, etc). Maybe the Protein Data Bank? But they want the refined rather than the raw data. Maybe institutional libraries? TARDIS project providing tools for data deposit and discovery, working with ARCHER, ARROW and Monash library... This field does benefit from standards such as MIAME, MIAPE etc, which are quite important in making stuff interoperable. Ashley's working with Simon Coles etc in the UK (who's mostly at the small molecule end).

So how to go forward? Maybe turning these databases into data-oriented journals, with peer review built in etc would be a way to go? Certainly it's a worry to me that the Nucleic Acids field in general lists >1,000 databases; there has to be a better way than turning everything into Yet Another Database...

ARROW Repositories day: 1

I’ve been giving a talk about the Research Repository System ideas at the ARROW repository day in Brisbane, Australia (which is partly why there has been a gap in posting recently). Here are some notes on the other talks.

Kate Blake from ARROW is talking about metadata. Particularly important for data, which cannot speak for itself. Metadata thought of as a compound object that comprises some parts for “library management” issues (things like author, title, keyword) for the whole document and/or its parts, plus University management parts, such as evidence records for research quality management purposes. These link to metadata that applies to the community of practice, eg the appropriate metadata for an X-ray image. Have the content (maybe a PDF), its rich metadata (Kate used MARC/XML as an example, which surprised me, since she also suggested this group was specific to the content), lightweight descriptive metadata, technical metadata (file size, type etc), administrative metadata, eg rights or other kinds of institutional metadata, preservation metadata such as PREMIS, and both internal and external relationship metadata. METS is one way to wrap this complex set of metadata and provide a structural map (there are others). (Worrying that this seems like a very large quantity of metadata for one little object…) Aha, she’s pointing out that aggregating these into repositories and these repositories together across search services leads ot problems of duplication, inconsistency, waste of effort, etc. So lots of work trying to unravel this knot, lots of acronyms: RDF, FRBR, SWAP, DCMI AM, SKOS etc…

FRBR making the distinction between the work, its expressions, manifestations and items. SWAP being a profile for scholarly works, as text works, not much use for data.

Names appear in multiple places in the metadata, and different parts have different rules. Do we have a name agent service (registries)? Need services to gather metadata automatically, that way you might introduce consistency and interoperability.

Kylie Pappalardo from QUT’s OAK Law project on legal issues on managing research data so it can be included in a repository and accessed by others. Government statements in favour of openness (eg Carr: More than one way to innovate, also “Venturous Australia” strategy). To implement these policies we need changes to practice and culture, institutional engagement, legal issues being addressed, etc. Data surrounded by law (!): copyright, contract, patents, policies, confidentiality, privacy, moral rights. Conflicting legal rights: who can do what with the data? QUT has OAK Law and also Legal Framework for e-Research project.

Survey online, May 2007, 176 participants responded. 50 depositing data in database; of those 46% said available openly, 46% required some or complete restrictions. 54% said their organisation did NOT have a data policy at all; where they did have a policy, most were give guidelines. 55% said they prepared plans for data management; two thirds of these at time of proposal, balance later. Should be early, not least because data management costs and should be part of the proposal, also disputes can be hard to resolve later. 7% felt that clearer info on sharing and re-use would help, and 90% wanted a “plain English” guide (who wouldn’t?). Lawyer language doesn’t help, so researchers make their own informal agreements… maybe OK if nothing goes wrong.

The group has a report: analysis of Legal Context of Infrastructure for Data Access and Re-use in Collaborative Research. Also Practical Data Management: a Legal and Policy Guide. They have some tools, including a “simple” Data Management (legal) toolkit to fill in to gather information about (eg) copyright ownership etc.

Peter Sefton of USQ talking about OAI-ORE, and what it can do for us. Making the point that we build things from a wide variety of standard components, which (mostly) work pretty well together, eg found bricks in a garden wall… OAI-PMH mostly works, moving metadata from one place to another. But it’s just the messenger, not the message. So a harvest of metadata across multiple repositories shows wide variations in the keywords, subjects etc. Problems with XACML for defining access policies: no standardisation on the names of subgroups, so in the end it’s no use for search. Point being that these standards may appear important but not work well in practice.

So on to ORE (Object Re-use and Exchange)… Pete asks “shouldn’t it be exchange, then re-use?”. ORE view of a blog post: a resource map describes the post, but in fact it’s an aggregation (compound object) of HTML text, a couple of images, comments in separate HTML, etc. The aggregation does have a URI, but does not have a fetchable reality (the resource map does). Can get complex very rapidly. See Repository Challenge at OpenRepositories 2008 in Southampton, and the ORE Challenge at RepoCamp 2008. USQ participating with Cambridge in JISC-funded TheOREM project, also internal image project called The Fascinator. Based on his ICE system that has been mentioned before, integrated with ORE tools to push stuff into the repository. Can have the repository watching so it can get it for itself.

ORE can: supplement OAI-PMH for moving content around; improve research tols like Zotero; replace use of METS packages; allow “thesis by publication” more elegantly; and pave the way for a repository architecture that understands content models (no more discussion of atomistic versus compound objects).

Wednesday, 8 October 2008

DCC Curation Lifecycle Model

I have not written much in this blog about Digital Curation Centre products, but I think it’s time to remedy that, and mention some of them. In particular, I wanted to mention the DCC Curation Lifecycle model, which is attracting widespread interest. Primarily put together by Sarah Higgins with input from colleagues across the DCC and external experts, like all such models it is of course a compromise between succinctness and completeness. Sarah has run a couple of workshops on it, including one in the US at JCDL 08 in Pittsburgh, and the response was extremely positive.

I hope to mention later how we will be using it to structure information on standards, and we are expecting to use it as an entry point to the DCC web site and the DCC DIFFUSE Standards Frameworks Project. In addition, I learned only recently that the more detailed proposals for the UK Research Data Service (which went to their Steering Committee a week or so back) lean heavily on the model (not apparent from their interim report). It is also used to explain the roles of data managers and data scientists in the JISC report “Skills, Role & Career Structure of Data Scientists & Curators: Assessment of Current Practice & Future Needs” by Alma Swan and Sheridan Brown.

Here is the graphic summary of the model:

I won’t attempt here to explain it in detail, as there’s sufficient additional information on the DCC web site and the longer IJDC article about the model.

Digital Curation Blog

Friday, 31 October 2008

Interactive Science Publishing

Tuesday, 14 October 2008

Open Access Day

ARROW Repositories day: 3

ARROW Repositories day: 2

ARROW Repositories day: 1

Wednesday, 8 October 2008

DCC Curation Lifecycle Model

Creative Commons

Blog Archive

Contributors

Labels