Digital Curation Blog: May 2008

Wednesday 28 May 2008

Wikiproteins...

Genome Biology has an article by Barend Mons, Michael Ashburner et al: "Calling on a million minds for community annotation in WikiProteins". From the abstract:

"WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. "

I'll say just a bit more on the Wikiproteins effort below, but I was also interested in this from the introduction:

"The exploding number of papers abstracted in PubMed [...] has prompted many attempts to capture information automatically from the literature and from primary data into a computer readable, unambiguous format. When done manually and by dedicated experts, this process is frequently referred to as 'curation'. The automated computational approach is broadly referred to as text mining."

I've been increasingly concerned recently to understand better the use of the word curation in this sense, which dates back to at least 1993, preceding our use of the term by a decade (eg 'curated databases' in genomics, etc). We try to cover this sense through the 'adding value' part of our definition ("Digital curation is maintaining and adding value to a trusted body of digital information for current and future use"), although I'm not sure it captures it fully.

Back at Wikiproteins, the idea is to combine the two approaches (manual curation by experts and sophisticated text mining). Jimmy Wales of Wikimedia Foundation is one of the authors of the paper, which adds an interesting dimension. The approach is based on "a software component called Knowlets™. [...] Scientific publications contain many re-iterations of factual statements. The Knowlet records relationships between two concepts only once. The attributes and values of the relationships change based on multiple instances of factual statements (...), increasing co-occurrence (...) or associations (...). This approach results in a minimal growth of the 'concept space' as compared to the text space..."

This is extraordinarily interesting, and I'm sure we'll hear much more about it in the near future. I particularly like the approach to expert-based quality control. There must be questions about long term sustainability, both organisationally and technically, but sceptics continue to be amazed at the sustainability of other kinds of Open activities!

Archiving images

One of our Associates recently posted a query on the DCC forum about image archiving formats:

'For long term preservation of digital images should we be archiving the proprietary .RAW files that come out of the digital SLRs we use or should we be converting these to the more open uncompressed .TIFF format and archiving those? Or indeed should we be archiving both?

Whilst they are proprietary uncompressed .RAW files contain much useful information that both helps us manage the image and may help us preserve the image in the future. They are no larger than typicall TIFF files, well, they needn't be. However, future version migration seems inevitable

TIFF files offer flecibility of working, and many tools are able to manipulate and format shift TIFF to produce a range of dissemination manifestations. However, if we work on the do it onece, do it right principle then long term storage becomes an issue of both capacity and cost.

Storing both TIFF and RAW only exacerbates the storage issue, plus maybe creates long term management issues.

What would you do? Store both, store the more open and flexible format or boldly go with the richness of RAW?

Your thoughts and comments would be appreciated.'

As image preservation is relevant to those curating scientific data, I thought I'd reproduce his post (with permission of course) here to try and stimulate a bit more debate.

First off - does it really have to be an 'either or' option? Why not store both? The cost of storage seems to just keep on coming down, and most architectures are more than capable of catering for multiple representations of an object. 'Do it once, do it right' is great if you can. But the 'lots of copies keeps stuff safe' principle has obvious benefits too. Storing multiple representations gives you more options in the future and whilst it's true that you managing additional files may require additional effort, I'm not sure that the potentially minimal costs of this would negate the potential benefits of multiple storage. (Reminder to self - read up on costs!) .

It's probably an opportune moment to go back to your user and preservation requirements too, to determine what you really want from your image archive and to what extent each format meets these requirements.

In terms of potential migrations, it's not inconceivable that we'll eventually come up with a better format than TIFF, so migration is potentially an issue regardless of which format you go for. More immediate - for me anyway- would be the length of time between migrations, and the ease of migration from one to another. These issues would need assessing too if going for one of the other (and it wouldn't be a bad thing to be aware of them even if you choose to store both versions).

Finally, Adobe recently submitted their 'DNG Universal RAW format' to ISO, so the issue of this one (because it seems there are several RAW formats) being a proprietary format may not be a lengthy concern. I'm not that familiar with DNG RAW so I don't know how much extra information it may contain when compared to TIFF. Another thing to add to my 'to-do' list... .

I'm sure our Associate would be grateful for any more input so do feel free to leave comments and I'll make sure he gets them.

Friday 23 May 2008

Data management & curation skills: survey

As the research process produces increasing volumes of digital data, the issue of who will look after the data throughout its life cycle comes into sharper focus. As a result, the JISC is funding work designed to better understand the skills, roles, career patterns and training needs of data managers and data curators.

Your views are important in this respect, and we will be very grateful if you would take a few minutes to consider and respond to an online survey on the DCC website. Your responses will help training providers address specific training requirements for digital preservation and data management across different sectors.

The survey is available from now until June 6th on the DCC website at http://www.dcc.ac.uk/jisc/data_projects_questionnaire/

Thursday 8 May 2008

JISC/CNI meeting, July 2008

This years JISC/CNI meeting will take place in Belfast from 10 - 11 July. Quoting from the website:

"The meeting will bring together experts from the United States, Europe and the United Kingdom. Parallel sessions will explore and contrast major developments that are happening on both sides of the Atlantic. It should be of interest to all senior management in information systems in the education community and those responsible for delivering digital services and resources for learning, teaching and research."

The programme looks to have a bit of something for everyone, including a session on scientific data in repositories and another on infrastructures to support research and learning. More details, including venue and costs, are available from the conference webpages.

Sunday 4 May 2008

On hearing the first cuckoo in spring

We heard out first cuckoo a week ago (27 April). It set me thinking: "The Times has been publishing 'first cuckoo' letters for a hundred years or so. Surely by now someone will have turned it into a database, as some kind of climate change proxy?".

A Google search - nay, several Google searches - found little direct evidence of this. Best I could find was Nature's Calendar, a site from the Woodland trust for the UK Phenology Network; the page pointed to defaults to plants, but you can select birds and then cuckoos, and see reports of first sighting by date on the UK map. You can even compare different years; I compared 2008 with 1998; several times more cuckoo reports in the later years, but maybe this is because 1998 was the earliest year and their network got better entrenched since then. Nature's Calendar also reports average sightings by year, which did show a later tendency, but usually such averages are suspect if you can't see the underlying data. Anyway, a far cry from my hope of a hundred year proxy database!

But then shock horror, I discovered the underlying premise is false; the Times has not been publishing first cuckoo letters since 1940 or so. They do occasionally publish something, eg:

'On April 21, 1972, Mr Wadham Sutton got away with a letter to the Editor reading: “Sir, Today I heard a performance of Delius’s On Hearing the First Cuckoo in Spring. This is a record.”'

Bah, humbug!

Tranche

Tranche sounds interesting. This possibly over-ambitious sound-bite from its web page:

"This project's goal is to solve the problems commonly associated with sharing scientific data, letting you and your collaborators focus on using the data."

In effect it is an encrypted, highly distributed file sharing system, independent of any central authority, and suitable for "any size of file", but maybe there are still some sharing problems left for the scientists (;-)?

The Science Commons blog reports:

"The National Cancer Institute will soon be using Tranche to store and share mouse proteomic data from its Mouse Proteomic Technologies Initiative (MPTI). Tranche, a free and open source file sharing tool for scientific data, was one of the earliest testers of CC0. Many thanks to Tranche for providing us with such valuable early feedback on CC0."

Tranche uploads are known as projects, and there are apparently 5399 of them so far. The largest number of replicas appears to be 3, and 44 projects are reported as having missing chunks.

Friday 2 May 2008

Science publishing, workflow, PDF and Text Mining

… or, A Semantic Web for science?

It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.

This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.

However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).

Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).

I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.

One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?

PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.

I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.

I think we should tackle this in several ways:

try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
tackle more domain ontologies to get agreements on semantics
work on microformats and related approaches to allow semantics to be silently encoded in documents
try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.

Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science...

Digital Curation Blog