Wednesday, 28 May 2008


Genome Biology has an article by Barend Mons, Michael Ashburner et al: "Calling on a million minds for community annotation in WikiProteins". From the abstract:
"WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. "
I'll say just a bit more on the Wikiproteins effort below, but I was also interested in this from the introduction:
"The exploding number of papers abstracted in PubMed [...] has prompted many attempts to capture information automatically from the literature and from primary data into a computer readable, unambiguous format. When done manually and by dedicated experts, this process is frequently referred to as 'curation'. The automated computational approach is broadly referred to as text mining."
I've been increasingly concerned recently to understand better the use of the word curation in this sense, which dates back to at least 1993, preceding our use of the term by a decade (eg 'curated databases' in genomics, etc). We try to cover this sense through the 'adding value' part of our definition ("Digital curation is maintaining and adding value to a trusted body of digital information for current and future use"), although I'm not sure it captures it fully.

Back at Wikiproteins, the idea is to combine the two approaches (manual curation by experts and sophisticated text mining). Jimmy Wales of Wikimedia Foundation is one of the authors of the paper, which adds an interesting dimension. The approach is based on "a software component called Knowlets™. [...] Scientific publications contain many re-iterations of factual statements. The Knowlet records relationships between two concepts only once. The attributes and values of the relationships change based on multiple instances of factual statements (...), increasing co-occurrence (...) or associations (...). This approach results in a minimal growth of the 'concept space' as compared to the text space..."

This is extraordinarily interesting, and I'm sure we'll hear much more about it in the near future. I particularly like the approach to expert-based quality control. There must be questions about long term sustainability, both organisationally and technically, but sceptics continue to be amazed at the sustainability of other kinds of Open activities!

  1. Chris, Back in the mid-1990s JISC's eLib programme funded the Open Journal project You may recall it. There are echoes of that project in the approach to 'knowlets', and the way this information is presented in the Concept Web Linker In the project we did this by creating linkbases from which links were superimposed on digitised journal content (there were few e-journals then). Like the current work, our major exemplar was in the area of life sciences. We thought this was quite a compelling approach that was ahead of its time, as we are now witnessing. Unfortunately little of the Open Journal demos that were created survive, largely because we were working with closed data, provided by the project's publisher partners, that we did not have permission to use beyond the project. What's changed since then is we have vastly increased digital content, open access to much of it, and the wiki framework to support collaborative working on this data. What might follow is a great example of the Semantic Web in action, especially if this framework really does attract a 'million minds' to work on it. It's good to see these ideas re-emerging.


