Friday, 2 May 2008

Science publishing, workflow, PDF and Text Mining

… or, A Semantic Web for science?

It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.

This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.

However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).

Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).

I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.

One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?

PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.

I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.

I think we should tackle this in several ways:
  • try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
  • try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
  • tackle more domain ontologies to get agreements on semantics
  • work on microformats and related approaches to allow semantics to be silently encoded in documents
  • try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
  • try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.
Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science...


  1. I thought this might be of interest to all readers

  2. A colleague, Colin Neilson, wrote in an email to me: "When I look at scientific articles published in PDF format they often fail to have much PDF tagging present. For instance tabular material is not tagged. This means it is very difficult to extract the data (although it can be read by a human reader). If a table is tagged in PDF it is possible to simply copy and paste the table into an excel spreadsheet. I think this would open up some programing approach to semi-automated data extraction. However many PDF articles delivered for scholarly publishers through electronic publication do not seem to be tagged."

    "There are three types of Adobe PDF documents: unstructured, structured, and tagged. Without tags although paragraphs and basic text formatting are recognised [...] there is no recognition of lists or tables as "logical" structures. In the case of our example scientific article getting the data from the table would be very to difficult to program since there is no logical table even though a human reader can see a table and understand its data in the context of the article."

  3. The analogy "PDF is a hamburger, and we're trying to turn it back into a
    cow" might provide a nice sound bite, but I'm afraid all it does is
    highlight Murray-Rusts's (M-R) lack of understanding of what PDF is, and
    more importantly what PDF could become to support open science.

    First, PDF should be viewed as a "cattle shed" for it was conceived as a
    container mechanism to support document workflows and not as a one-to-one
    replacement for PostScript. Second, PDF has itself been implemented in an
    XML-friendly form, see: and . Thus PDF vs. XML arguments
    fall under the category of "bad science" and Swift is
    probably spinning in his grave.

    M-R is quite right to complain that the cattle-shed is currently stocked
    with hamburgers rather than cattle, but that is not the fault of PDF. With
    the right authoring tools, it is possible to construct a PDF that has the
    look and feel of a traditional typeset document and yet have it act as a
    container for ancillary material to serve various niche needs. M-R has
    picked up on the need for robots to harvest data from scientific papers,
    which could easily be handled using embedded file streams, using his
    preferred XML, but this is just one application. There are many others to

    So as to avoid being unduly subjective, I will elaborate
    using a concrete example.

    My background is computational fluid dynamics, specifically high speed
    flows. A search at, using jj quirk, will unearth a
    paper I wrote way back in 1992 "A contribution to the great Riemann Solver
    Debate." This paper currently has close to 300 citations and yet it almost
    never found its way into print. As a fresh PhD I was saddened by the
    polarization of the review process: one reviewer loved the manuscript, the
    other hated it so much he reportedly refused to provide a review.
    Fortunately, the editor went with the favourable review and the work was
    published, as is, with no modifications whatsoever.

    This episode, which lasted close to two years, left such an impression
    that ever since I have been dreaming of self-substantiating, computational
    documents that would allow the interested reader to sample the reported
    work first-hand, right down to its smallest detail.

    This application is very different from M-R's. Instead of providing
    the reader with data, I want to be able to provide the reader
    with the means of regenerating the data; not just today or tomorrow,
    but at any future date. The background motivations for wanting to do this
    are covered elsewhere. See for example

    The zeroth-order point to take on board here, is that with the burgeoning
    of computational science the quality of published papers is very variable
    and an uncomfortable amount has to be taken on trust. Consequently
    open science should not hinge solely on the harvesting of data. It should
    also be about raising and maintaining computational standards.
    See for instance, "Toward Scientific Numerical Modelling"
    and the recent special issue of Computing in Science and Engineering
    ( January/February 2009. Additionally, in these
    economically uncertain times, there is also the
    beleaguered tax-payer to consider (see
    as well as the thorny problem of how can an economy guarantee
    the technical competence of its workforce (see for instance and

    For anyone prepared to scratch below the surface, PDF and related technologies
    (such as Air and Alchemy) hold lot of promise for tackling these broader
    issues head on. The PDF format is far from perfect and God knows how many
    bug reports I've filed over the years, but I believe its scientific strengths
    outweigh its scientific weaknesses. Of course, on the larger scale of things,
    what I believe is immaterial. But anyone who stumbles across this
    comment might want to consider the following thought experiment

    Last summer I gave a plenary lecture at a UK e-Science meeting, see . The meeting
    was sponsored by The Royal Society which will celebrate its 350th
    anniversary in 2010, and I have to say my presentation was a
    shambles owing to an unco-operative projector and a 100+F fever.
    But the thought experiment to try is sound enough: what will a
    scientific historian make of a paper authored today in 2360,
    the 700th anniversary of The Royal Society?

    The purpose of looking so far into the future is to move beyond
    today's format wars in an attempt to determine what form a scientific
    document should take so as to leave a lasting legacy. We might like
    to kid ourselves that as a society we are scientifically
    sophisticated, but I can't help but feel future historians
    will condemn us as being remarkably shallow in our document thinking.

    James Quirk
    March 6th 2009
    Los Alamos


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.