Digital Curation Blog: RDFa

Showing posts with label RDFa. Show all posts

Tuesday, 2 February 2010

Thursday, 21 January 2010

Linked data and staff contact pages

You may remember that I am interested in the extent to which we should use Semantic Web (or Linked Data) on the DCC web site. After some discussions, I reached the conclusion that we should do so, but the tools were not ready yet (this isn’t quite an Augustinian “Oh Lord, make me good but not yet”; specifically, we are moving our web site to Drupal 6, the Linked Data stuff will not be native until Drupal 7, and our consultants are not yet up to speed with Linked Data). I have to say that not all our staff are convinced of the benefits of using RDF etc on the web site, and I have had a mental note to write more about this, real soon now.

I was reminded of this recently. I wanted to phone a colleague who worked at UKOLN, one of our partners, and I didn’t have his details in my address book. So I looked on their web site and navigated to his contacts page. Once there I copied his details into the address book, before lifting the phone to give him a ring. After the call (he wasn’t there; the snow had closed the office), I thought about that process. I had to copy all those details! Wouldn’t it be great if I could just import them somehow? How could that be? UKOLN have expertise in such matters, so I tweeted Paul Walk (now Deputy Director, previously technical manager) asking whether they had considered making the details accessible as Linked Data using something like FOAF. You can guess I’m not fully up to speed with this stuff, but I’m certainly trying to learn!

Paul replied that they had considered putting microformats into the page (I guess this is the hCard microformat), and then asked me whether my address book understood RDF, or if I was going to script something? I was pretty sure the answer to the second part was “no” as I suspect such scripting currently is beyond me, and told Paul that I was using MacOSX 10.6 Address Book; it says nothing about RDF, but will import a vcard. I was thinking that if there was appropriate stuff (either hCard microformat or RDFa with FOAF) on the page, I might find an app somewhere that would scrape it off and make a vcard I could import.

Paul’s final tweet was: “@cardcc see the use-case, not sure it's a 'linked data' problem though. What are the links that matter if you're scraping a single contact?”

Well, I couldn’t think of a 140-character answer to that question, which seemed to raise issues I had not thought about properly. What are the links that matter? Was it linked data, or just coded data that I wanted? Is this really a semantic web question rather than linked data? Or is it a RDF question? Or a vocabulary question? Gulp!

After some thought, perhaps Paul was as constrained by his 140 characters as I was. Surely a contacts page contains both facts and links within itself. See the Wikipedia page on FOAF for examples of a FOAF file in turtle for Jimmy Wales; the coverage is pretty much like a contacts page.

So Paul’s contact page says he works for UKOLN at the University of Bath, and gives the latter’s address (I guess formally speaking he works in UKOLN, an administrative unit, and is employed by the University); that his position in UKOLN is Deputy Director, that his phone, fax and email addresses are x, y and z. All of these are relationships between facts, expressible in the FOAF vocabulary. With RDFa, that information could be explicitly encoded in the HTML of the page and understood by machines, rather than inferred from the co-location of some characters on the page (the human eye is much better at such inferences). So there’s RDF, right there. Is that Linked Data? Is it Semantic Web? I’m not really sure.

More to the point, would it have been any greater use to me if it had been so encoded? A FOAF-hunting spider could traverse the web and build up a network of people, and I might be able to query that network, and even get the results downloaded in the form of a vcard that I could import into my Mac Address Book. That sounds quite possible, and the tools may already exist. Or, there may exist an app (what we used to call a Small Matter Of Programming, or a SMOP) that I could point at a web page with FOAF RDFa on it. Perhaps that’s what Paul was after in relation to scripting. Maybe the upcoming Dev8D might find this an interesting task to look at?

What other things could be done with such a page? Well, Paul or others might use it to disambiguate the many Paul Walk alter egos out there. You’ll see I have a simple link to Paul’s contact page above, but if this blog were RDF-enabled, perhaps we could have a more formal link to the assertions on the page, eg to that Paul Walk’s phone number, that Paul Walk’s email address, etc.

Well I’m not sure if this makes sense, and it does feel like one of those “first fax machine” situations. However FOAF has been around for a long while now. Does that mean that folk don’t perceive an advantage in such formal encodings to balance their costs, or is this an absence of value because of a lack of exploitable tools? If so, anyone going to Dev8D want to make an app for me?

(It’s also possible of course that Paul doesn’t want his details to be spidered up in this way, but I guess none of us should put contact details on the web if that’s our position.)

By the way, I found a web page called FOAF-a-matic that will create FOAF RDF for you. Here's an extract from what it created for me, in RDF:

<foaf:Person rdf:ID="me"> <foaf:name>Chris Rusbridge</foaf:name> <foaf:title>Mr</foaf:title> <foaf:givenname>Chris</foaf:givenname>
<foaf:family_name>Rusbridge</foaf:family_name>
<foaf:mbox rdf:resource="mailto:c.rusbridge@xxxxx"/> <foaf:workplaceHomepage rdf:resource="http://www.dcc.ac.uk/"/>
</foaf:Person>

What could I do with that now?

Monday, 6 April 2009

Semantically richer PDF?

PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.

In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.

The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).

RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).

PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.

Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!

What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.

So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).

Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.

PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.

So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?

References:

Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/

Friday, 2 May 2008

Science publishing, workflow, PDF and Text Mining

… or, A Semantic Web for science?

It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.

This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.

However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).

Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).

I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.

One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?

PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.

I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.

I think we should tackle this in several ways:

try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
tackle more domain ontologies to get agreements on semantics
work on microformats and related approaches to allow semantics to be silently encoded in documents
try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.

Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science...

Tuesday, 18 March 2008

Novartis/Broad Institute Diabetes data

Graham Pryor spotted an item on the CARMEN blog, pointing to a Business Week article (from 2007, we later realised) about a commercial pharma (Novartis) making research data from its Type 2 Diabetes studies available on the web. This seemed to me an interesting thing to explore (as a data person, not a genomics scientist), both for what it was, and for how they did it.

I could not find a reference to these data on the Novartis site, but I did find a reference to a similar claim dating back to 2004, made in the Boston Globe and then in some press releases from the Broad Institute in Cambridge, MA, referring to their joint work with Novartis (eg initial announcement, first results and further results). The first press release identified David Altshuler as the PI, and he was kind enough to respond to my emails and point me to their pages that link to the studies and to the results they are making available.

Why make the data available? The Boston Globe article said "Commercially, the open approach adopted by Novartis represents a calculated gamble that it will be better able to capitalize on the identification of specific genes that play a role in Type 2 diabetes. The firm already has a core expertise in diabetes. Collaborating on the research will give its scientists intimate knowledge of the results."

The Business Week article said "...the research conducted by Novartis and its university partners at MIT and Lund University in Sweden merely sets the stage for the more complex and costly drug identification and development process. According to researchers, there are far more leads than any one lab could possibly follow up alone. So by placing its data in the public domain, Novartis hopes to leverage the talents and insights of a global research community to dramatically scale and speed up its early-stage R&D activities."

Thus far, so good. Making data available for un-realised value to be exploited by others is at the heart of the digital curation concept. There are other comments on these announcements that cynically claim that the data will have already been plundered before being made accessible; certainly the PIs will have first advantage, but there is nothing wrong with that. The data availability itself is a splendid move. It would be very interesting to know if others have drawn conclusions from the data (I did not see any licence terms, conditions, or even requests such as attribution, although maybe this is assumed as scientific good practice in this area).

Business Week go on to draw wider conclusions:

"The Novartis collaboration is just one example of a deep transformation in science and invention. Just as the Enlightenment ushered in a new organizational model of knowledge creation, the same technological and demographic forces that are turning the Web into a massive collaborative work space are helping to transform the realm of science into an increasingly open and collaborative endeavor. Yes, the Web was, in fact, invented as a way for scientists to share information. But advances in storage, bandwidth, software, and computing power are pushing collaboration to the next level. Call it Science 2.0."

I have to say I'm not totally convinced about the latter phrase. Magazines like Business Week do like buzz-words like Science 2.0, but so far comparatively little science is affected by this kind of "radicalsharing". Genomics is definitely one of the poster children in this respect, but the vast majority of science continues to be lab or small group based with an orientation towards publishing results as papers, not data.

So what have they made available? There are 3 diabetes projects listed:

The second of these does not appear to have data available online. The 3rd project has results data in the form of an Excel spreadsheet, with 20 columns and 1294 rows; the data appear relatively simple (a single sheet, with no obvious formulae or Excel-specific issues that I could see), and could probably have been presented just as easily as CSV or another text variant. There's a small amount of header text in row 2 that spans columns, plus some colour coding, that may have justified the use of Excel. Short to medium term access to these data should be simple.

The first project shows two different types of results, with a lot more data: Type 2 Diabetes results and Related Traits results. The Type 2 Diabetes results comprise a figure in JPEG or PDF, plus data in two forms: a HTML table of "top single-marker and multi-marker results", and a tab-delimited text file (suitable for analysis with Haploview 4.0) of "all single-marker and multi-marker results". These data are made available both as the initial release of February 2007, and an updated release from March 2007. There is a link to Instructions for using the results files, effectively short-hand instructions for feeding the data into Haploview and doing some analyses on them. The HTML table is just that; data in individual cells are numbers or strings, without any XML or other encoding. There are links to entries in NCBI, HapMap and Ensembl, however.

The Related Traits results also come in an initial release (also February 2007) and an updated release from September 2007. The results again have a summary, a table this time but still in JPEG or PDF form. The detailed results are more complex; there is a HTML table of traits in 4 groups (Glucose, Obesity, Lipid and Blood Pressure), and for each trait (eg Fasting Glucose) up to 4 columns of data. The first column is a description of the trait as a PDF, the next is a link to a HTML Table of Top Single Marker Results for Association, the next is a link to a text Table of All Single Marker Results for Association, and the last is a link to a text table of Phenotype summary statistics by genotype (both these have the same format as above, although the latter has different columns).

It seems clear that there is a lot of data here; how useful they are to other scientists is not for me to judge. Certainly a scientist looking through these pages could form judgments on the usefuleness and relevance of these data to his or her work. There's not much to help a robot looking for science data from the Internet. I'm not sure what form such information might take, although there are examples in Chemistry. Perhaps the data cells should be automatically encoded according to a relevant ontology, so that the significance of the data travels with them. Possibly microformats or RDFa could have (or come to have) some relevance. However, both the HTML and text formats are very durable, (more so than the Excel format for project 3) and should be easily accessible (or transformed into later forms) at least as long as the Broad Institute wishes to continue to make them available.

Wednesday, 29 August 2007

IJDC again

At the end of July I reported on the second issue of the International Journal of Digital Curation (IJDC), and asked some questions:

"We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

"For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet."

I didn't get any responses, but over on the eFoundations blog, Andy Powell was taking us to task for only offering PDF:

"Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy."

His blog is more widely read than this one, and he attracted 11 comments! The gist of them was that PDF plus HTML (or preferably XML) was the minimum that we should be offering. For example, Chris Leonard [update, not Tom Wilson! See end comments] wrote:

"People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions availble for precisely this purpose."

Cornelius Puschmann [update, not Peter Sefton] wrote:

"Yeah, but if you really want semantic markup why not do it right and use XML? The problematic thing with OJS (at least to some extent) is/was that XML article versions are not the basis for the "derived" PDF and HTML, which deal almost purely with visuals. XML is true semantic markup and therefore the best way to store articles in the long term (who knows what formats we'll have 20 years from now?). HTML can clearly never fill that role - it's not its job either. From what I've heard OJS will implement XML (and through it neat things such as OpenOffice editing of articles while they're in the workflow) via Lemon8 in the future."

Bruce D'Arcus [update, not Jeff] says:

"As an academic, I prefer the XHTML + PDF option myself. There are times I just want to quickly view an article in a browser without the hassle of PDF. There are other times I want to print it and read it "on the train."

"With new developments like microformats and RDFa, I'd really like to see a time soon where I can even copy-and-paste content from HTML articles into my manuscripts and have the citation metadata travel with it."

Jeff [update, not Cornelius Puschmann] wrote:

"I was just checking through some OJS-based journals and noticed that several of them are only in PDF. Hmmm, but a few are in HTML and PDF. It has been a couple of years since I've examined OJS but it seems that OJS provides the tools to generate both HTML and PDF, no? Ironically, I was going to do a quick check of the OJS documentation but found that it's mostly only in PDF!

"I suspect if a journal decides not to provide HTML then it has some perceived limitations with HTML. Often, for scholarly journals, that revolves around the lack of pagination. I noticed one OJS-based journal using paragraph numbering but some editors just don't like that and insist on page numbers for citations. Hence, I would be that's why they chose PDF only."

I think in this case we used only PDF because that was all our (old) version of the OJS platform allowed. I certainly wanted HTML as well. As I said before, we're looking into that, and hope to move to a newer version of the platform soon. I'm not sure it has been an issue, but I believe HTML can be tricky for some kinds of articles (Maths used to be a real difficulty, but maybe they've fixed that now).

I think my preference is for XHTML plus PDF, with the authoritative source article in XML. I guess the workflow should be author-source -> XML -> XHTML plus PDF, where author-source is most likely to be MS Word or LaTeX... Perhaps in the NLM DTD (that seems to be the one people are converging towards, and it's the one adopted by a couple of long term archiving platforms)?

But I'm STILL looking for more concrete ideas on how we should co-present data with our articles!

[Update: Peter Sefton pointed out to me in a comment that I had wrongly attributed a quote to him (and by extension, to everyone); the names being below rather than above the comments in Andy's article. My apologies for such a basic error, which also explains why I had such difficulty finding the blog that Peter's actual comment mentions; I was looking in someone else's blog! I have corrected the names above.

In fact Peter's blog entry is very interesting; he mentions the ICE-RS project, which aims to provide a workflow that will generate both PDF and HTML, and also bemoans how inhospitable most repository software is to HTML. He writes:

"It would help for the Open Access community and repository software publishers to help drive the adoption of HTML by making OA repositories first-class web citizens. Why isn't it easy to put HTML into Eprints, DSpace, VITAL and Fez?

"To do our bit, we're planning to integrate ICE with Eprints, DSpace and Fedora later this year building on the outcomes from the SWORD project – when that's done I'll update my papers in the USQ repository, over the Atom Publishing Protocol interface that SWORD is developing."

So thanks again Peter for bringing this basic error to my attention, apologies to you and others I originally mis-quoted, and I look forward to the results of your efforts! End Update]

Digital Curation Blog

Tuesday, 2 February 2010

More on contact pages and linked data