Digital Curation Blog: eJournals

Showing posts with label eJournals. Show all posts

Monday, 6 April 2009

Semantically richer PDF?

PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.

In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.

The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).

RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).

PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.

Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!

What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.

So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).

Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.

PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.

So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?

References:

Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/

Tuesday, 31 March 2009

More on the ICTHES journals

I've had 3 responses by email to yesterday's post on the ICTHES journals (some responding to an associated email from me on the same issue). I'll summarise the two where quote permission was not explicit, and quote the third at length.

Adam Farquhar of the BL told me he had discussed it with their serials processing team under the voluntary scheme for legal deposit of digital material, and they will download the material into the BL's digital archive, where it will become accessible in the reading rooms (in due course, I guess). Wider access to such open access material should be available later under their digital library programme.

Tony Kidd, of the University of Glasgow and UKSG suggested that an OpenLOCKSS type approach might be feasible. This is consistent with the email from Vicky Reich of LOCKSS; she told me I could post her response. So here it is:

"UK-LOCKSS can, and should, preserve the four ICTHES journals.
First step: Contact the publisher and ask them to leave the content online long enough for it to be ingested.
Second step: Ask the publisher to put online a LOCKSS permission statement.
Third step: Someone on the LOCKSS team does a small amount of technical work to get content ingested.
With these minimal actions, the content would be available to those institutions who are preserving it in their LOCKSS box.

If librarians want to rehost this O/A content for others, there are two additional requirements:
a) the content has to be licensed to allow re-publication by someone other than the original copyright holder. This is best done via a Creative Commons license.
b) institutions who hold the content have to be willing to bear the cost of hosting the journals on behalf of the world.
Librarians, even those who advocate open access have not taken coordinated steps to ensure the OA literature remains viable over the long term. Librarians are motivated to ensure perpetual access to very expensive subscription literature, but ensuring the safety of the OA literature is not a priority because... it's available, and it's free. [...]

When the majority of librarians who think open access is a "good idea" step up and preserve this content (and I don't mean shoving individual articles into institutional repositories), then we will be well on our way to building needed infrastructure"

See also the comment from Gavin Baker to yesterday's post, which i think backs up Vicky's last point:

"I've thought for a while that archiving OA journals should be a goal of the library and OA community, maybe via a consortium which would harvest new issues of journals listed in the DOAJ. (We can treat as separate, for these purposes, the question of short-term archiving in case a journal goes under from the question of long-term preservation.) Is there a reason why this approach isn't undertaken? Do people assume that any OA journal worth archiving is already being archived by somebody somewhere?"

Let's be quite clear, contrary to my simplistic assumptions, the Internet Archive is NOT undertaking this task!

Monday, 30 March 2009

Charity closing, possible loss of 4 OA titles

I note from Gavin Baker's [not Peter Suber's; my mistake- CR] blog entry that the charity ICTHES is closing, and as a result its 4 OA journals, listed may disappear. I have checked the Internet Archive, and in case we should be complacent about that as a system of preservation, found only 1 issue out of 18 issues from 4 titles had actually been gathered there.

The Journals are

I see from Suncat that these titles are variously held by BL, Cambridge, Oxford and NLS, so I guess they are regarded as serious titles.

Since UKSG is now in progress, I wondered if I could challenge UKSG on what it (or we, the community) can and/or should and/or will do about this! Would there be any opportunity in the programme to discuss this? (BTW unfortunately I am not able to come to Torquay, so I'm niggling, and indeed watching the #uksg tweets, from a distance.)

Options for action that I can see include

a) some kind of sponsored crawl by Internet Archive

b) an emergency sponsored crawl by UKWAC or one of its participants (which may of course already have happened),

c) an urgent approach by a group of those participating in LOCKSS for the charity to join the programme (which may be stymied by lack of development effort and time), would only make available to participants, I think

d) Ditto for CLOCKSS, which at least might have the resources to make available publicly on a continuing basis,

e) sponsored ingest into something like Portico; again, only available to participants as I understand it

f) tacitly suggest libraries grab copies of the 18 or so PDFs, or

g) get a group of libraries to offer to host a historical archive of the titles for the charity...

h) appraise the titles as not worth preserving, and consign to the bitbin of history

i) ummm, errr, dither...

PS this blog entry is based on an email sent to the conference organisers and others, unfortunately after the conference has started. I have already had one response, from the BL, suggesting they would discuss with their journals people...

Digital Curation Blog

Monday, 6 April 2009

Semantically richer PDF?

Tuesday, 31 March 2009

More on the ICTHES journals

Monday, 30 March 2009

Charity closing, possible loss of 4 OA titles

Creative Commons

Blog Archive

Contributors

Labels