It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.
This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.
However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).
Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).
I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.
One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?
PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.
I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.
I think we should tackle this in several ways:
- try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
- try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
- tackle more domain ontologies to get agreements on semantics
- work on microformats and related approaches to allow semantics to be silently encoded in documents
- try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
- try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.