Thursday 6 October 2011

Thoughts before "The Future of the Past of the Web"

Tomorrow I'm going to be in London for a joint JISC/DPC event on web archiving, "The Future of the Past of the Web" (hashtag #fpw11 if you're so inclined.) It's the third in an occasional series; I gave the closing presentation at the second event and I have been asked to be on a closing panel this time round. One of the things we've been asked to reflect on is what changes have taken place since the last event and how far our expectations have been realised. I thought it would be useful to set my thoughts on this down in advance, partly to help me articulate my own thinking. It will be interesting to see how various views develop during the panel session tomorrow.

Image Courtesy Martin Dodge's Cybergeography collection
Looking back at my concerns in mid-2009 I'm greatly reassured. There were a number of worrying trends apparent in web archives at that time and an apparent lack of bold vision in how we might use web archives in the future - or even in the present. My fear was that the collecting policies, preservation policies and interfaces offered were all taking a very human and document-centric view of what a web archive should do. In OAIS terms, the Designated Community was people who wanted to view individual old web pages having done a search for a particular site, or possibly for a keyword of some sort. The National Archives had taken one incremental but powerful step beyond that, automatically linking archived web pages to 404 pages on government web sites via simple plugins for Apache & IIS, but in the end this still involved serving individual pages for people to read.

That's a valid use case, but by no means the only ones. I set out a few other things we might want to be able to do but could not with the interfaces that web archives gave us.

  • What search results would we have got on the web of 1998 using the search engines of 1998?

  • What results would we have got using current search engines on the web of 1998?

  • How can we visualise the set of links to or from a particular site changing over time?

  • Treating the web as a corpora of text over time how can we track the emergence of words or concepts and their emergence from specialist vocabulary to general use?

  • As historians of technology, how we can use a web archive to track things like the emergence of PNG as an image format and the decline of XPM (the original icon format for graphical browsers such as Mosaic)?


I also wanted to show how open APIs or RESTful interfaces can allow others to develop innovative ways to view content. Since there weren't any web archives with such interfaces I fell back on demonstrating the point with Flickr, more particularly with simple visual beauty that is TagGalaxy. TagGalaxy shows how the ability to search and retrieve images and tags lets someone else build a completely different interface to the Flickr repository, one which minimises textual interaction and which encourages serendipitous discovery. It would have been wonderful to be able to do that with a web archive. Similarly, if Brian Kelly had been able to say to the Internet Archive 'give me all the versions of the home page of the University of Bath between these dates' in a single interaction, it would have been much easier for him to build the informative animation he used in his own presentations for JISC PoWR. I could go on, and at the time I did.

Much of what I hoped for then has happened. The architecture of Memento makes it straightforward to view collections of web archives as a single entity from some viewpoints. Projects funded by "Digging Into Data" have shown the power of large web collections in viewing the web as data at many levels. And although most (all?) web archives are not yet offering the APIs or interfaces that would permit us to do some of the things above, I think they at least accept that these are valid aspirations.

Moreover, web archiving has moved from being a specialist concern to something that appears in the letters pages of national newspapers. That, and the type of talks we're going to hear tomorrow, show how far we've moved in 2 1/2 years. I'm quietly confident that things are getting better.

Friday 30 April 2010

DCC survey on social media use

The DCC is currently trying to gather information on the use of social media by those looking at research data management issues. It's already been publicised through a number of routes, so you may already be aware of it. If not, please give us 5 minutes of your time (yes, really 5 minutes - perhaps even less!) to answer a few questions on the survey page we set up:

http://www.surveymonkey.com/s/8KXJDMW

There's no need to identify yourself and we'll only be using the data in aggregate form.

Wednesday 31 March 2010

Linked Data and Reality

I have a copy of the really interesting book “Data and Reality” by William Kent. It’s interesting at several levels; first published in 1978, this appears to be a “print-on-demand” version of the second edition from 1987. Its imprint page simply says “Copyright © 1998, 2000 by William Kent”.

The book is full of really scary ways in which the ambiguity of language can cause problems for what Kent often calls “data processing systems”. He quotes Metaxides:

“Entities are a state of mind. No two people agree on what the real world view is”
Here’s an example of Kent from the first page:
“Becoming an expert in data structures is… not of much value if the thoughts you want to express are all muddled”
But it soon becomes clear that most of us are all too easily muddled, at least when
“... the thing that makes computers so hard is not their complexity, but their utter simplicity… [possessing] incredibly little ordinary intelligence”
I do commend this book to those (like me) who haven’t had formal training in data structures and modelling.

I was reminded of this book by the very interesting attempt by Brain Kelly to find out whether Linked Data could be used to answer a fairly simple question. His challenge was ‘to make use of the data stored in DBpedia (which is harvested from Wikipedia) to answer the query

“Which town or city in the UK has the highest proportion of students?"
He has written some further posts on the process of answering the query, and attempting to debug the results.

So what was the answer? The query produced the answer Cambridge. That’s a little surprising, but for a while you might convince yourself it’s right; after all, it’s not a large town and it has 2 universities based there. The table of results shows the student population as 38,696, while the population of the town is… hang on… 12? So the percentage of students is 3224%. Yes, something is clearly wrong here, and Brian goes on to investigate a bit more. No clear answer yet, although it begins to look as if the process of going from Wikipedia to DBpedia might be involved. Specifically, Wikipedia gives (gave, it might have changed) “three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900)”. But querying DBpedia gave him “three values for population: 12, 73 and 752,900”.

There is of course something faintly alarming about this. What’s the point of Linked Data if it can so easily produce such stupid results? Or worse, produce seriously wrong but not quite so obviously stupid results? But in the end, I don’t think this is the right reaction. If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government?

And that’s what Chris Wallace has done. He used pretty reliable data (although the Guardian’s in there somewhere ;-), and built a robust query. He really knows what he’s doing. And the answer is… drum roll… Milton Keynes!

I have to admit I’d been worrying a bit about this outcome. For non-Brits, Milton Keynes is a New Town north west of London with a collection of concrete cows, more roundabouts than anywhere (except possibly Swindon, but that’s another story), and some impeccable transport connections. It’s also home to Britain’s largest University, the Open University. The trouble is, very few of those students live in Milton Keynes, or even come to visit for any length of time (just the odd Summer School), as the OU operates almost entirely by distance learning. So if you read the query as “Which town or city in the UK is home to one or more universities whose registered students divided by the local population gives the largest percentage?”, then it would be fine.

And hang on again. I just made an explicit transition there that has been implicit so far. We’ve been talking about students, and I’ve turned that into university students. We can be pretty sure that’s what Brian meant, but it’s not what he asked. If you start to include primary and secondary school students, I couldn’t guess which town you’d end up with (and it might even be Milton Keynes, with a youngish population).

My sense of Brian’s question is “Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?”. Or something like that (remember Metaxides, above). Go on, have a go at expressing your own version more precisely!

The point is, these things are hard. Understanding your data structures and their semantics, understanding the actual data and their provenance, understanding your questions, expressing them really clearly: these are hard things. That’s why informatics takes years to learn properly. Why people worry about how the parameters in a VCard should be expressed in RDF. It matters, and you can mess up if you get it wrong.

People sometimes say there’s so much dross and rubbish on the Internet, that searches such as Google provides are no good. But in fact with text, the human reader is mostly extraordinarily good at distinguishing dross from diamonds. A couple of side searches will usually clear up any doubts.

But people don’t do data well. Automated systems do, SPARQL queries do. We ought to remember a lot more from William Kent, about the ambiguities of concepts, but especially that bit about computers possessing incredibly little ordinary intelligence. I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…

Tuesday 9 March 2010

When data shouldn’t be open?

There is a big momentum these days about data being accessible, available, and re-usable. Increasingly people want open data; Science Commons have been recommending using CC0 to make the fully open status of data clear. More recently the Panton Principles start:

“Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.”

We’ve been big fans of Open Access at the DCC since its early days. We use a Creative Commons licence for our content by default. This blog was one of the earliest to be specific about a Creative Commons licence not only for the core text that we write, but also for the comments that you might add here.

So we strongly support the Open Data approach… where possible. For of course in some areas of science and research, there are data that cannot be open. Usually this is because the data are sensitive. They could be personal data, protected under Data Protection laws. Sensitive personal data (such as medical record data) has extra requirements under those laws. They could be financial microdata, commercially sensitive. Or perhaps data with strong commercial exploitation potential. They could be anthropological data, sensitive through cultural requirements. Research needs to go anywhere, whatever the issues; we can’t be constrained to only research where the data can be open.

So perhaps it’s as simple as that: some science should have open data, and some should have closed data?

Well, maybe not. Because the underlying issue of the Panton Principles must still apply. Research should be verifiable, whether through repeatable experiments or through re-analysable data. Unverifiable research is, well, unreliable- perhaps indistinguishable from fraud. Some access is needed; perhaps we should think of even sensitive data as Less Open Data rather than closed data.

So how do you go about dealing with sensitive data? Keep it secure, transfer securely, provide access under strict licences and controls in dat enclaves, aggregate, de-identify, anonymise, there are plenty of tricks in the book. That’s the topic of the 4th Research Data Management Forum starting tomorrow in Manchester. I’ll hope to have more to write about what we learn later.

A Blue Ribbon for Sustainability?

When we talk about long term digital preservation, about access for the future, about the digital records of science, or of government, or of companies, or the designs of ships or aircraft, the locations of toxic wastes, and so on being accessible for tens or hundreds of years, we are often whistling in the dark to keep the bogeys at bay. These things are all possible, and increasingly we know how to achieve them technically. But much more than non-digital forms, the digital record needs to be continuously sustained, and we just don’t know how to assure that. Providing future access to digital records needs action now and into that future to provide a continuous flow of the necessary will, community participation, energy and (not least) money. Future access requires a sustainable infrastructure. Ensuring sustainability is one of the major unsolved problems in providing future access through digital preservation.

For the past two years I have been lucky enough to be a member of the grandly named Blue Ribbon Task Force on Sustainable Digital Preservation and Access, along with a stellar cast of experts in preservation, in the library and archives worlds, in data, in movies… and in economics. C0-chaired by Fran Berman (previously of SDSC, now of RPI) and Brian Lavoie of OCLC, the Task Force produced an Interim Report (PDF) a year ago, and has just released its Final Report (Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information, also PDF). (The Task Force was itself sustained by an equally stellar cast of sponsors, including the US National Science Foundation and the Andrew W. Mellon Foundation, in partnership with the Library of Congress, the UK’s JISC, the Council on Library and Information Resources, and NARA.)

Sustainability is often equated to keeping up the money supply, but we think it’s much more than that. The Task Force specifically looks at economic sustainability; it says early in the Executive Summary that it’s about

… mobilizing resources—human, technical, and financial—across a spectrum of stakeholders diffuse over both space and time.”

If you want a FAQ on funding your project over the long term you won’t find it here. Nor will you find a list of benefactors, or pointers to tax breaks, or arguments for your Provost. Instead you should find a report that helps you think in new ways about sustainability, and apply that new thinking to your particular domain. For one of our major conclusions is that there are no general, across the board answers.

One of the great things about this Task Force was its sweeping ambition. Not just content with bringing together a new economics of sustainable digital preservation, but thinking so broadly. This was never about some few resources, or this Repository or that Archive, it was about the preservation and long term access of major areas of our intellectual life, like scholarly communication, like research data, like commercially owned cultural content (the movie industry is part of this), and the blogosphere and variants (collectively produced web content). Looking at those four areas holistically rather than as fragments forced us to recognise how different they are, and how much those differences affect their sustainability. They aren’t the only areas, and indeed further work on other areas would be valuable, but they were enough to make the Task Force think differently from any activity I have taken part in before.

The report is, to my mind, exceedingly well written, thanks to Abby Smith Rumsey; it far exceeds the many rather muddled conversations we had during our investigations. It has many quotable quotes; among my favourites is

“When making the case for preservation, make the case for use.”

Reading the report is not without its challenges, as you might expect. It has to marry two technical vocabularies and make them understandable to both communities. I’ve been living partly in this world for two years, and still sometimes stumble over it; I remember many times screwing up my forehead, raising my hand and asking “Tell us again, what’s a choice variable?” And the reader will have to think about things like derived demand for depreciable durable assets, nonrival in consumption, temporally dynamic and path-dependent, not to mention the free rider problem. These concepts are there for a reason however; get them straight and you’ll understand the game a lot better.

And there are not surprisingly big underlying US-based assumptions in places, although the two resident Brits (myself and Paul Ayris of UCL) did manage to inject some internationalism. Further work grounded in other jurisdictions would be extremely valuable.

Overall I don’t think this report is too big an ask for anyone anywhere who is serious about understanding the economic sustainability of digital preservation and future access to digital materials. I hope you find the great value that I believe exists here.

Monday 1 March 2010

DCC: A new phase, a new perspective, a new Director

As the DCC begins its third phase today, I am delighted to announce the appointment of our new Director, Kevin Ashley, who will succeed me upon my retirement in April 2010.

Kevin Ashley has been Head of Digital Archives at the University of London Computer Centre (ULCC) since 1997, during which time his multi-disciplinary group has provided services related to the preservation and reusability of digital resources on behalf of other organisations, as well as conducting research, development and training. The group has operated the National Digital Archive of Datasets for The National Archives of the UK for over twelve years, delivering customised digital repository services to a range of organisations. As a member of the JISC's Infrastructure and Resources Committee, the Advisory Council for ERPANET, plus several advisory boards for data and archives projects and services, Kevin has contributed widely to the research information community. As a firm and trusted proponent of the DCC we look forward to his energetic leadership in this new phase of our evolution.

So far so press release. But I'd go further. I can't tell you how pleased I am with this appointment. As some readers will know, I have personally lobbied all and any potential candidates for this post since before I officially announced I was leaving. I understand we had some excellent candidates (I wasn't directly involved), more than one of whom might have made an excellent Director. But I'm particularly pleased at Kevin's appointment for several reasons: he is well engaged in the community including good connections with JISC, our major funder), he's tough enough to keep this tricky collaboration thing going, he has an excellent technical understanding, and he has great experience of actually managing this stuff in all its crusty awfulness. I particularly remember his discussion (on a visit to the Edinburgh Informatics Database Group) about issues like how best to deal with an archived dataset where they came across the characters "five" in a field defined as numeric! You can make it work or make it a record but not both...

So congratulations Kevin, and good luck!


Thursday 4 February 2010

Persistent identifiers workshop comes round again

It seems to be the one event that people think is important enough to go to, even though they fear in their hearts that, yet again, not a lot of progress will be made. Most of those at yesterday’s JISC-funded Persistent Identifiers workshop yesterday had been to several such meetings before. For my part, I learned quite a lot, but the slightly flat outcome was not all that unexpected. It’s not quite Groundhog Day, as things do move forward slightly from one meeting to the next.

Part of the trouble is in the name. There is this tendency to think that persistent identifiers can be made persistent by some kind of technical solution. To my mind this is a childish belief in the power of magic, and a total abrogation of responsibility; the real issues with “persistent” identifiers are policy and social issues. Basically, far too many people just don’t get some simple truths. If you have a resource which has been given some kind of identifier that resolves to its address (so people can use it), and you change that address without telling those who manage the identifier/resolution, then the identifier will be broken. End of, as they say!

This applies whether you have an externally managed identifier (DOI, Handle, PURL) or an internally managed identifier (eg a well-designed HTTP URI… Paul Walk threatened to throw a biscuit at the first person to mention “Cool URLs”, but had to throw it at himself!).

Now clearly some identifiers have traction in some areas. Thanks to the efforts of CrossRef and its member publishers, the DOI is extremely useful in the scholarly journal literature world. You really wouldn’t want to invent a new identifier for journal articles now, and if you have a journal that doesn’t use DOIs (ahem!), you would be well-advised to sign up. It looks very affordable for a small publisher: $275 per year plus $1 per article.

Even for such a well-established identifier, with well-defined policies and a strong set of social obligations, things do go wrong. I give you Exhibit A, for example, in which Bryan Lawrence discovers that dereferencing a DOI for a 2001 article on his publications list leads to "Content not found" (apologies for the “acerbic” nature of my comment there). It looks like this was due to a failure of two publishers to handle a journal transfer properly; the new publisher made up a new DOI for the article, and abandoned the old one. Aaaaarrrrrrggggghhhhhhh! Moving a resource and giving it a new DOI is a failure of policy and social underpinning (let alone competence) that no persistent identifier scheme can survive! CrossRef does its best to prevent such fiascos occurring, but see social issues above. People fail to understand how important this is, or simple things like: the DOI prefix is not part of your brand!

Whether a DOI is the right identifier to use for research data seems to me a much more open question. The issue here is whether the very different nature of (at least some kinds of) research data would make the DOI less useful. The DataCite group is committed to improving the citability of research data (which I applaud), but also seems to be committed to use of the DOI, which is a little more worrying. While the DOI is clearly useful for a set of relatively small, unchanging digital objects published in relatively small numbers each year (eg articles published in the scholarly literature), is it so useful for a resource type which varies by many orders of magnitude in terms of numbers of objects, rate of production, size of object, granularity of identified subset, and rate of change? In particular, the issue of how a DOI should relate to an object that is constantly changing (as so many research datasets do) appears relatively un-examined.

There was some discussion, interesting to me at least, on the relationships of DOIs to the Linked Data world. If you remember, in that world things are identified by URIs, preferably HTTP URIs. We were told (via the twitter backchannel, about which I might say more later) that DOIs are not URIs, and that the dx.doi.org version is not a DOI (nor presumably is the INFO URI version). This may be fact, but seems to me rather a problem, as it means that "real DOIs" don't work as 1st class citizens of a Linked data World. If the International DOI Foundation were to declare that the HTTP version was equivalent to a DOI, and could be used wherever a DOI could be used, then the usefulness of the DOI as an identifier in a Linked Data world might be greatly increased.

A question that’s been bothering me for a while is when an “arms-length” scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier. We know that such well-managed HTTP identifiers can be extremely persistent; as far as I know all of the eLib programme URIs established by UKOLN in 1995 still work, even though UKOLN web infrastructure has completely changed (and I suspect that those identifiers have outlasted the oldest extant DOI, which must have happened after 1998). Such a local identifier remains under your control, free of external costs, and can participate fully in the Linked Data world; these are quite significant advantages. It seems to me that the main advantage of the set of “arms-length” identifiers is that they are independent of the domain, so they can be managed even if the original domain is lost; at that point, a HTTP URI redirect table could not be set up. So I’m afraid I joked on twitter that perhaps “use of a DOI was a public statement of lack of confidence in the future of your organisation”. Sadly I missed waving the irony flag on this, so it caused a certain amount of twitter outrage that was unintentional!

In fact the twitter backchannel was extremely interesting. Around a third or so of the twits were not actually at the meeting, which of course was not apparent to all. And it is in the nature of a backchannel to be responding to a heard discourse, not apparent to the absent twits; in other words, the tweets represent a flawed and extremely partial view of the meeting. Some of those who were not present (who included people in the DOI world, the IETF and big publishers) seemed to get quite the wrong end of the stick about what was being said. On the other hand, some external contributions were extremely useful and added value for the meat-space participants!

I will end with one more twitter contribution. We had been talking a bit about the publishing world, and someone asked how persistent are academic publishers. The tweet came back from somewhere “well, their salespeople are always ringing us up ;-) !