Wednesday 31 March 2010

Linked Data and Reality

I have a copy of the really interesting book “Data and Reality” by William Kent. It’s interesting at several levels; first published in 1978, this appears to be a “print-on-demand” version of the second edition from 1987. Its imprint page simply says “Copyright © 1998, 2000 by William Kent”.

The book is full of really scary ways in which the ambiguity of language can cause problems for what Kent often calls “data processing systems”. He quotes Metaxides:

“Entities are a state of mind. No two people agree on what the real world view is”
Here’s an example of Kent from the first page:
“Becoming an expert in data structures is… not of much value if the thoughts you want to express are all muddled”
But it soon becomes clear that most of us are all too easily muddled, at least when
“... the thing that makes computers so hard is not their complexity, but their utter simplicity… [possessing] incredibly little ordinary intelligence”
I do commend this book to those (like me) who haven’t had formal training in data structures and modelling.

I was reminded of this book by the very interesting attempt by Brain Kelly to find out whether Linked Data could be used to answer a fairly simple question. His challenge was ‘to make use of the data stored in DBpedia (which is harvested from Wikipedia) to answer the query

“Which town or city in the UK has the highest proportion of students?"
He has written some further posts on the process of answering the query, and attempting to debug the results.

So what was the answer? The query produced the answer Cambridge. That’s a little surprising, but for a while you might convince yourself it’s right; after all, it’s not a large town and it has 2 universities based there. The table of results shows the student population as 38,696, while the population of the town is… hang on… 12? So the percentage of students is 3224%. Yes, something is clearly wrong here, and Brian goes on to investigate a bit more. No clear answer yet, although it begins to look as if the process of going from Wikipedia to DBpedia might be involved. Specifically, Wikipedia gives (gave, it might have changed) “three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900)”. But querying DBpedia gave him “three values for population: 12, 73 and 752,900”.

There is of course something faintly alarming about this. What’s the point of Linked Data if it can so easily produce such stupid results? Or worse, produce seriously wrong but not quite so obviously stupid results? But in the end, I don’t think this is the right reaction. If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government?

And that’s what Chris Wallace has done. He used pretty reliable data (although the Guardian’s in there somewhere ;-), and built a robust query. He really knows what he’s doing. And the answer is… drum roll… Milton Keynes!

I have to admit I’d been worrying a bit about this outcome. For non-Brits, Milton Keynes is a New Town north west of London with a collection of concrete cows, more roundabouts than anywhere (except possibly Swindon, but that’s another story), and some impeccable transport connections. It’s also home to Britain’s largest University, the Open University. The trouble is, very few of those students live in Milton Keynes, or even come to visit for any length of time (just the odd Summer School), as the OU operates almost entirely by distance learning. So if you read the query as “Which town or city in the UK is home to one or more universities whose registered students divided by the local population gives the largest percentage?”, then it would be fine.

And hang on again. I just made an explicit transition there that has been implicit so far. We’ve been talking about students, and I’ve turned that into university students. We can be pretty sure that’s what Brian meant, but it’s not what he asked. If you start to include primary and secondary school students, I couldn’t guess which town you’d end up with (and it might even be Milton Keynes, with a youngish population).

My sense of Brian’s question is “Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?”. Or something like that (remember Metaxides, above). Go on, have a go at expressing your own version more precisely!

The point is, these things are hard. Understanding your data structures and their semantics, understanding the actual data and their provenance, understanding your questions, expressing them really clearly: these are hard things. That’s why informatics takes years to learn properly. Why people worry about how the parameters in a VCard should be expressed in RDF. It matters, and you can mess up if you get it wrong.

People sometimes say there’s so much dross and rubbish on the Internet, that searches such as Google provides are no good. But in fact with text, the human reader is mostly extraordinarily good at distinguishing dross from diamonds. A couple of side searches will usually clear up any doubts.

But people don’t do data well. Automated systems do, SPARQL queries do. We ought to remember a lot more from William Kent, about the ambiguities of concepts, but especially that bit about computers possessing incredibly little ordinary intelligence. I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…

6 comments:

  1. I guess this pretty much sums up my general view of the internet, which is that the challenge is not in finding the answer, but in asking the right question.

    My concern with linked data is that all we do is shift the level of human interaction up a notch, so the involvement is in validating the recursive variations in the query.

    ReplyDelete
  2. An interesting post, Chris!

    You said, I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…

    My goodness, imagine how dangerous and chaotic it would be if random people were allowed to write and publish documents onto interconnected networks of computers! We'd surely never be able to find what we were looking for! If and when we did, how could we possibly determine the authority under which the artifacts were published? Oh my!

    The answer is, we would avail ourselves to the properties of networks, which when allowed to work at scale tend to ensure that mostly the right things happen --- if we don't muck things up! And of course that is where we are today with the Web of Documents, and where we'll be soon with the Web of Data.

    Scale is important; Brian Kelly's challenge was a bit like trying to use Wikipedia in its very early days, or possibly even like answering questions using, er, AltaVista. Google works not only because of PageRank, but because PageRank (and other secret sauces) are working over an extraordinarily large field of data. See for example Peter Norvig's great talk on Theorizing from Data for more on the importance of scale w.r.t. datasets.

    In conclusion, to be fair I think Brian's challenge should be seen as only a benchmark, a sampling of the effectiveness of linked data practices today. Do it again next month...next year...in five years and see whether linked data really was been dangerous, and whether carefully-crafted systems really were necessary!

    ReplyDelete
  3. Spot on, Chris and thanks for the link. Its great to see one of my heroes and his seminal work mentioned - "Data and Reality" is a book I advise anyone modeling data to read - a "threshold concept" - you can never look at data modeling in the same way after reading it. However it can trigger 'analysis paralysis'.
    I'm reminded of a little poem by one Mrs Edmund Craster-
    Poem – The centipede
    by Mrs Edmund Craster (d. 1874)


    A centipede was happy quite,
    Until a toad in fun
    Said ‘Pray which leg moves after which ?
    This raised her doubts to such a pitch
    She fell exhausted in a ditch,
    Not knowing how to run.

    While lying in this plight,
    A ray of sunshine caught her sight;
    She dwelt upon its beauties long,
    Till breaking into happy song,
    Unthinking she began to run,
    And quite forgot the croakers fun.

    Kent died in 2005 and stopped writing in the mid '90s leaving a great corpus of work. As with another hero of mine, Michael Jackson (of JSP and JSD fame) I often wonder at how ideas could be curated in a way that fundamental ideas don't have to be re-discovered each generation - or perhaps that's the way knowledge works - but surely digital curation is only valuable for the content of its ideas?

    Chris

    ReplyDelete
  4. @bitwacker, you're right to chastise me! I mentioned in the twitter promo that maybe that last sentence was not quite right. There's nothing wrong with Linked Data of course; they are just data. But there do seem to be rather greater opportunities for error in the context-free world of SPARQL. I shouldn't have confused the two... oh, muddle, muddle!!!

    ReplyDelete
  5. While there may not be anything theoretically wrong with Linked Data, Chris, I think what you're pointing out (quite rightly) is a rather dramatic gap between theory and operating practice. Anyone who's been through a basic course on descriptive and inferential statistics is aware of the ways in which results are deeply influenced by data collection practices and methods, and that trying to combine data sets collected at different times, with different methods, under different operating assumptions is a good way to produce an unholy mess. While there's nothing wrong with Linked Data in theory, the realities of how data is produced I believe are going to make the actual practice of linking data infinitely more difficult than its proponents suspect, and the problem actually gets worse as you try to scale up, not better.

    I don't want to sound defeatist on this. Linked data has already produce some fascinating and useful results, but I am detecting a certain "hey this is *easy*" attitude sinking into discussions on linked data. A bit more caution isn't a bad thing in this case. As you commented, much of the Linked Data effort is operating under an assumption that context doesn't (shouldn't?) matter. But the context in which data is produced matters a lot when you start trying to link it all up. And the obvious errors (population 12?) aren't the ones we need to worry about the most.

    ReplyDelete
  6. "While there may not be anything theoretically wrong with Linked Data..."

    Read Kents book! The only conclusion that one can make is that the Semantic Web would appear wholly "theoretically wrong" to him, being analagous to a mixture of the pure binary and pseudo binary models that he spent so long warning against. Of particular use would be his discussion of irreducible tuples and why being n-ary is so important.

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.