Monday, 6 April 2009

Semantically richer PDF?

PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.

In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.

The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).

RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).

PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.

Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!

What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.

So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).

Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.

PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.

So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?

References:

Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/

5 comments:

  1. Possibly related is GeoPDF, a way of embedding projection metadata using a "dictionary entry" (I don't know exactly what that means) to georeference a PDF document. "Georeference" in this context means to spatialize the entire document, i.e., to map between PDF coordinate space and geographic coordinates, as opposed to tagging geographic placenames in the document (which is what microformats have been proposed for in the HTML world, and which is closer to the subject of this blog post). For more info on GeoPDF, probably easiest to look at the GeoPDF wikipedia page. -Greg Janée

    ReplyDelete
  2. Chris, I think that this is an idea worth exploring, but the big issue is authoring tools. Given that most docs start life in word processors, or increasingly in online editors like Google docs or CMSs, how can people capture semantics, and cite stuff? I am working on an idea that this could all be done using hyperlinks. Citations certainly could, and I know that endpoints like georeferences could be linked. But what about other assertions such as linking a name to an online service that can tie together a relation "dc:creator" with another endpoint such as an author's web page.

    If we can capture semantics like that then yes, they probably could be embedded in PDF later using something like annotations. One simple thing we do with our ICE project is to turn links into endnotes for the PDF - so that might be another human-readable approach to express semantics.

    Of course none of this is of any use and will therefore not be adopted unless there are services which make use the semantics, such as repositories that automatically understand the metadata. Journal or thesis submission guidelines might also help as the purpose is then clear - do it our way or no publication.

    ReplyDelete
  3. @Peter_Sefton, I did mention a few chemical semantic authoring systems, and it's also possibly worth exploring Bryan Lawrence's "just do it" approach at http://home.badc.rl.ac.uk/lawrence/blog/2008/04/23/creating_rdfa.

    If we all agreed with the chicken and egg worry, it would be hard to make progress. I do think that RDF is important, and semantically rich journal articles are important, and that if we can create semantically rich journal articles based on RDF we will have the prospect of much better science in the future!

    ReplyDelete
  4. Great article!

    Just a couple of points...

    First, there are ways (in the existing PDF file format) to incorporate arbitrary information (be it XML/RDF/XMP or other formats) into a PDF and associate it with specific content elements. It's called "Tagged PDF", has been part of PDF since 1.4, is fully compatible with PDF/A-1 and is how things such as reflow and accessibility work in PDF today. However, there aren't any standards for what to put there (beyond the structure elements defined in ISO 32000-1, aka ISO PDF) that it can be used outside of a closed workflow.

    You'll be happy to know that I've been working on exactly this problem recently - trying to standardize more semantic information in PDF at more fine grained levels. I have a proposal pending for the next revision of ISO 32000 that will help.

    Finally, I'd like to correct something you said which is "A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007)". This is not actually true. PDF/A-1 prohibits SPECIFIC THINGS - but if something isn't specifically prohibited, then it's permitted. HOWEVER, a "conforming reader" is not required to process any of it.

    Leonard Rosenthol
    PDF Standards Architect, Adobe Systems
    ISO Project Leader, ISO 19005 (PDF/A)

    ReplyDelete
  5. @Leonard, thanks for your comment. Sorry if I misread the PDF/A spec in relation to annotations. I was relying on the first sentence of section 6.5.2 of ISO 19005-1:2005, namely "Annotation types not defined in PDF Reference shall not be permitted". I meant to put this in the context of an extension for some sort of "semantic annotation" type that might contain the RDF, but that sentence seemed to make such an idea less desirable.

    I've been interested in Tagged PDFs for a while, although they appear not to be widely used, and I got the impression, probably wrongly, that Adobe had decided to do accessibility in different ways. From what I remember, it looked like it made sense to Tag at the section/subsection level, but almost certainly not at the level of a string that one wishes to markup as an object such as a chemical compound, chemical reaction, etc.

    So I am extremely pleased that Adobe has a proposal for more fine-grained semantic information for the main ISI 32000 standard! Let's hope we can get something like that widely implemented, as it would surely complement and encourage semantic authoring tools that are beginning to emerge, and will be important in the world of research. I even started to wonder about GRDDL-for-PDF!

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.