Friday 24 July 2009

Semantic Web of Linked Data for Research?

In the beginning was the World Wide Web. Then we were going to have the Semantic Web. (Then we had Web 2.0, but that’s another story.) But maybe the Semantic Web wasn’t semantic enough for some, so they changed the name to Linked Data, and it began to take off a little more. Now there’s an argument on whether all linked data are Linked Data!

The debate started with Andy Powell asking on Twitter what name we should use when all the conditions for Linked Data are met except for one, which was the requirement that data be expressed in standards, specifically RDF (see Andy's summary). Tim Berners Lee had suggested there were 4 principles for Linked Data:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

There were quite strong divisions; one group says roughly: “Linked Data is a brand and a definition; live with it”, while the other group says something like “Linked Data can afford to be inclusive, and will benefit from that” (both of these are extreme simplifications). I’ve read all the remarks and they’re pretty convincing; I mostly agree with them (not much help to you, gentle reader!). Paul Walk's summary is quite balanced. However, I particularly liked a comment made on someone else’s blog post by Dan Brickley, who should know about RDF (quoted by Andy in the post mentioned above):

“I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Statistical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analogous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)”

I think this makes lots of sense for research data. I’ve been wondering for some time how RDF fits into the world of research data. I asked the NERC Data Managers at their meeting earlier this year, and the general consensus appeared to be that RDF was good for the metadata, but not the actual research data. This seems reasonable and is consistent with Dan’s view above.

But it does rather raise the question about exactly what kinds of data RDF IS suitable for. It begins to look as if it is good for isolated facts, simple relationships and descriptive data. While RDF probably can encode most things you would put in databases or scientific datasets, generally it would be very difficult to express what those databases and datasets can express, and there would be a massive explosion of triples if one tried.

To answer Andy’s original question (what name…), although I was taken with the idea of linked data, it’s clearly too easy to confuse with Linked Data. So I think I’d go with Paul Walk’s suggestion of Web of Data, or interchangeably Dan Brickley's data Web. If we can weave research data into a Web of Data, we’ll be doing well!

3 comments:

  1. i think "simple relationships and descriptive data" covers quite a bit of research data as well. i've been involved in publishing linguistic data on the web (http://wals.info/), and we are right now planning to make this data available as linkeddata as well - not at least to make it easier for ourselves to aggregate and mix with other data lateron.

    so i guess RDF is useful for research data as well in particular its unique quality of making a posteriori merging easy (see http://www.betaversion.org/~stefano/linotype/news/304/).

    ReplyDelete
  2. As Dan Chudnov noted on Twitter (http://twitter.com/dchud/status/2903133946), it looks like timbl added a subtle change between two versions (compare http://www.w3.org/DesignIssues/LinkedData.html and http://web.archive.org/web/20080208081824/http://www.w3.org/DesignIssues/LinkedData.html of the Linked Data design issues write-up. In part, this change was to the third principle. The most recent version explicitly mentions RDF and SPARQL, while the earlier versions do not.

    ReplyDelete
  3. RDF is essential as a base layer to orient ourselves towards and as an integration medium. It's a canvas to paint on. The point is not convert everything to RDF and keep it as RDF, but rather to create the various formats we use in a form so that it can be virtualized, atomized and then integrated in a rich RDF graph form. If you think about what Cambridge Semantics' Anzo and the Semantic Discovery System are doing, these are both tools that assume format heterogeneity on the input side, and move that heterogeneity to a unified, virtual, analyzable RDF format on the output side. Once that processing happens, you can cross content/data silos and do some powerful kinds of analysis, not to mention mere retrieval. So the goal should be RDF-friendly data, Web pages (RDFa embedded in HTML5, ultimately?), and other content, including PDFs. The more RDF-friendly (and well described in terms of the W3C Sem Web stack) the "data" are, the better for Web-scale integration and analysis purposes. We discuss the overall value of a W3C-compliant Sem Web stack from an enterprise perspective at http://www.pwc.com/us/en/technology-forecast/spring2009/index.jhtml.

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.