Digital Curation Blog: Science publishing, workflow, PDF and Text Mining

Friday, 2 May 2008

Science publishing, workflow, PDF and Text Mining

… or, A Semantic Web for science?

It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.

This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.

However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).

Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).

I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.

One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?

PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.

I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.

I think we should tackle this in several ways:

try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
tackle more domain ontologies to get agreements on semantics
work on microformats and related approaches to allow semantics to be silently encoded in documents
try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.

Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science...

3 comments:

Bio Saga5 May 2008 at 10:09
I thought this might be of interest to all readers
http://lukeskywaran.blogspot.com/search/label/digitized
ReplyDelete
Replies
Chris Rusbridge9 May 2008 at 15:16
A colleague, Colin Neilson, wrote in an email to me: "When I look at scientific articles published in PDF format they often fail to have much PDF tagging present. For instance tabular material is not tagged. This means it is very difficult to extract the data (although it can be read by a human reader). If a table is tagged in PDF it is possible to simply copy and paste the table into an excel spreadsheet. I think this would open up some programing approach to semi-automated data extraction. However many PDF articles delivered for scholarly publishers through electronic publication do not seem to be tagged."

"There are three types of Adobe PDF documents: unstructured, structured, and tagged. Without tags although paragraphs and basic text formatting are recognised [...] there is no recognition of lists or tables as "logical" structures. In the case of our example scientific article getting the data from the table would be very to difficult to program since there is no logical table even though a human reader can see a table and understand its data in the context of the article."
ReplyDelete
Replies
Anonymous6 March 2009 at 20:02
The analogy "PDF is a hamburger, and we're trying to turn it back into a
cow" might provide a nice sound bite, but I'm afraid all it does is
highlight Murray-Rusts's (M-R) lack of understanding of what PDF is, and
more importantly what PDF could become to support open science.

First, PDF should be viewed as a "cattle shed" for it was conceived as a
container mechanism to support document workflows and not as a one-to-one
replacement for PostScript. Second, PDF has itself been implemented in an
XML-friendly form, see: http://labs.adobe.com/technologies/mars and
http://blogs.adobe.com/insidepdf/2007/09 . Thus PDF vs. XML arguments
fall under the category of "bad science" and Swift is
probably spinning in his grave.

M-R is quite right to complain that the cattle-shed is currently stocked
with hamburgers rather than cattle, but that is not the fault of PDF. With
the right authoring tools, it is possible to construct a PDF that has the
look and feel of a traditional typeset document and yet have it act as a
container for ancillary material to serve various niche needs. M-R has
picked up on the need for robots to harvest data from scientific papers,
which could easily be handled using embedded file streams, using his
preferred XML, but this is just one application. There are many others to
consider.

So as to avoid being unduly subjective, I will elaborate
using a concrete example.

My background is computational fluid dynamics, specifically high speed
flows. A search at scholar.google.com, using jj quirk, will unearth a
paper I wrote way back in 1992 "A contribution to the great Riemann Solver
Debate." This paper currently has close to 300 citations and yet it almost
never found its way into print. As a fresh PhD I was saddened by the
polarization of the review process: one reviewer loved the manuscript, the
other hated it so much he reportedly refused to provide a review.
Fortunately, the editor went with the favourable review and the work was
published, as is, with no modifications whatsoever.

This episode, which lasted close to two years, left such an impression
that ever since I have been dreaming of self-substantiating, computational
documents that would allow the interested reader to sample the reported
work first-hand, right down to its smallest detail.

This application is very different from M-R's. Instead of providing
the reader with data, I want to be able to provide the reader
with the means of regenerating the data; not just today or tomorrow,
but at any future date. The background motivations for wanting to do this
are covered elsewhere. See for example http://www.amrita-ebook.org/doc/amr2003
and http://reproducibleresearch.org.

The zeroth-order point to take on board here, is that with the burgeoning
of computational science the quality of published papers is very variable
and an uncomfortable amount has to be taken on trust. Consequently
open science should not hinge solely on the harvesting of data. It should
also be about raising and maintaining computational standards.
See for instance, "Toward Scientific Numerical Modelling"
http://nato-rto-latex.googlecode.com/files/RTO-MP-AVT-147-P-17-Kleb.pdf
and the recent special issue of Computing in Science and Engineering
(http://cise.aip.org) January/February 2009. Additionally, in these
economically uncertain times, there is also the
beleaguered tax-payer to consider (see http://www.taxpayeraccess.org/frpaa/)
as well as the thorny problem of how can an economy guarantee
the technical competence of its workforce (see for instance
http://www.futureofinnovation.org/ and
http://royalsociety.org/document.asp?tip=0&id=7322).

For anyone prepared to scratch below the surface, PDF and related technologies
(such as Air and Alchemy) hold lot of promise for tackling these broader
issues head on. The PDF format is far from perfect and God knows how many
bug reports I've filed over the years, but I believe its scientific strengths
outweigh its scientific weaknesses. Of course, on the larger scale of things,
what I believe is immaterial. But anyone who stumbles across this
comment might want to consider the following thought experiment

Last summer I gave a plenary lecture at a UK e-Science meeting, see
http://www.allhands.org.uk/2008/programme/jamesquirk.cfm . The meeting
was sponsored by The Royal Society which will celebrate its 350th
anniversary in 2010, and I have to say my presentation was a
shambles owing to an unco-operative projector and a 100+F fever.
But the thought experiment to try is sound enough: what will a
scientific historian make of a paper authored today in 2360,
the 700th anniversary of The Royal Society?

The purpose of looking so far into the future is to move beyond
today's format wars in an attempt to determine what form a scientific
document should take so as to leave a lasting legacy. We might like
to kid ourselves that as a society we are scientifically
sophisticated, but I can't help but feel future historians
will condemn us as being remarkably shallow in our document thinking.

James Quirk
March 6th 2009
Los Alamos
ReplyDelete
Replies

Add comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.

Digital Curation Blog

Friday, 2 May 2008

Science publishing, workflow, PDF and Text Mining

3 comments:

Creative Commons

Blog Archive

Contributors

Labels