Digital Curation Blog

Thoughts before "The Future of the Past of the Web"

2011-10-06T16:04:00.006+01:00

Tomorrow I'm going to be in London for a joint JISC/DPC event on web archiving, "The Future of the Past of the Web" (hashtag #fpw11 if you're so inclined.) It's the third in an occasional series; I gave the closing presentation at the second event and I have been asked to be on a closing panel this time round. One of the things we've been asked to reflect on is what changes have taken place since the last event and how far our expectations have been realised. I thought it would be useful to set my thoughts on this down in advance, partly to help me articulate my own thinking. It will be interesting to see how various views develop during the panel session tomorrow.

Looking back at my concerns in mid-2009 I'm greatly reassured. There were a number of worrying trends apparent in web archives at that time and an apparent lack of bold vision in how we might use web archives in the future - or even in the present. My fear was that the collecting policies, preservation policies and interfaces offered were all taking a very human and document-centric view of what a web archive should do. In OAIS terms, the Designated Community was people who wanted to view individual old web pages having done a search for a particular site, or possibly for a keyword of some sort. The National Archives had taken one incremental but powerful step beyond that, automatically linking archived web pages to 404 pages on government web sites via simple plugins for Apache & IIS, but in the end this still involved serving individual pages for people to read.

That's a valid use case, but by no means the only ones. I set out a few other things we might want to be able to do but could not with the interfaces that web archives gave us.

What search results would we have got on the web of 1998 using the search engines of 1998?

What results would we have got using current search engines on the web of 1998?

How can we visualise the set of links to or from a particular site changing over time?

Treating the web as a corpora of text over time how can we track the emergence of words or concepts and their emergence from specialist vocabulary to general use?

As historians of technology, how we can use a web archive to track things like the emergence of PNG as an image format and the decline of XPM (the original icon format for graphical browsers such as Mosaic)?

I also wanted to show how open APIs or RESTful interfaces can allow others to develop innovative ways to view content. Since there weren't any web archives with such interfaces I fell back on demonstrating the point with Flickr, more particularly with simple visual beauty that is TagGalaxy. TagGalaxy shows how the ability to search and retrieve images and tags lets someone else build a completely different interface to the Flickr repository, one which minimises textual interaction and which encourages serendipitous discovery. It would have been wonderful to be able to do that with a web archive. Similarly, if Brian Kelly had been able to say to the Internet Archive 'give me all the versions of the home page of the University of Bath between these dates' in a single interaction, it would have been much easier for him to build the informative animation he used in his own presentations for JISC PoWR. I could go on, and at the time I did.

Much of what I hoped for then has happened. The architecture of Memento makes it straightforward to view collections of web archives as a single entity from some viewpoints. Projects funded by "Digging Into Data" have shown the power of large web collections in viewing the web as data at many levels. And although most (all?) web archives are not yet offering the APIs or interfaces that would permit us to do some of the things above, I think they at least accept that these are valid aspirations.

Moreover, web archiving has moved from being a specialist concern to something that appears in the letters pages of national newspapers. That, and the type of talks we're going to hear tomorrow, show how far we've moved in 2 1/2 years. I'm quietly confident that things are getting better.

DCC survey on social media use

2010-04-30T09:50:00.003+01:00

The DCC is currently trying to gather information on the use of social media by those looking at research data management issues. It's already been publicised through a number of routes, so you may already be aware of it. If not, please give us 5 minutes of your time (yes, really 5 minutes - perhaps even less!) to answer a few questions on the survey page we set up:

http://www.surveymonkey.com/s/8KXJDMW

There's no need to identify yourself and we'll only be using the data in aggregate form.

Linked Data and Reality

2010-03-31T10:29:00.002+01:00

I have a copy of the really interesting book “Data and Reality” by William Kent. It’s interesting at several levels; first published in 1978, this appears to be a “print-on-demand” version of the second edition from 1987. Its imprint page simply says “Copyright © 1998, 2000 by William Kent”.

The book is full of really scary ways in which the ambiguity of language can cause problems for what Kent often calls “data processing systems”. He quotes Metaxides:

“Entities are a state of mind. No two people agree on what the real world view is”

Here’s an example of Kent from the first page:

“Becoming an expert in data structures is… not of much value if the thoughts you want to express are all muddled”

But it soon becomes clear that most of us are all too easily muddled, at least when

“... the thing that makes computers so hard is not their complexity, but their utter simplicity… [possessing] incredibly little ordinary intelligence”

I do commend this book to those (like me) who haven’t had formal training in data structures and modelling.

I was reminded of this book by the very interesting attempt by Brain Kelly to find out whether Linked Data could be used to answer a fairly simple question. His challenge was ‘to make use of the data stored in DBpedia (which is harvested from Wikipedia) to answer the query

“Which town or city in the UK has the highest proportion of students?"
He has written some further posts on the process of answering the query, and attempting to debug the results.

So what was the answer? The query produced the answer Cambridge. That’s a little surprising, but for a while you might convince yourself it’s right; after all, it’s not a large town and it has 2 universities based there. The table of results shows the student population as 38,696, while the population of the town is… hang on… 12? So the percentage of students is 3224%. Yes, something is clearly wrong here, and Brian goes on to investigate a bit more. No clear answer yet, although it begins to look as if the process of going from Wikipedia to DBpedia might be involved. Specifically, Wikipedia gives (gave, it might have changed) “three population counts: the district and city population (122,800), urban population (130,000), and county population (752,900)”. But querying DBpedia gave him “three values for population: 12, 73 and 752,900”.

There is of course something faintly alarming about this. What’s the point of Linked Data if it can so easily produce such stupid results? Or worse, produce seriously wrong but not quite so obviously stupid results? But in the end, I don’t think this is the right reaction. If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government?

And that’s what Chris Wallace has done. He used pretty reliable data (although the Guardian’s in there somewhere ;-), and built a robust query. He really knows what he’s doing. And the answer is… drum roll… Milton Keynes!

I have to admit I’d been worrying a bit about this outcome. For non-Brits, Milton Keynes is a New Town north west of London with a collection of concrete cows, more roundabouts than anywhere (except possibly Swindon, but that’s another story), and some impeccable transport connections. It’s also home to Britain’s largest University, the Open University. The trouble is, very few of those students live in Milton Keynes, or even come to visit for any length of time (just the odd Summer School), as the OU operates almost entirely by distance learning. So if you read the query as “Which town or city in the UK is home to one or more universities whose registered students divided by the local population gives the largest percentage?”, then it would be fine.

And hang on again. I just made an explicit transition there that has been implicit so far. We’ve been talking about students, and I’ve turned that into university students. We can be pretty sure that’s what Brian meant, but it’s not what he asked. If you start to include primary and secondary school students, I couldn’t guess which town you’d end up with (and it might even be Milton Keynes, with a youngish population).

My sense of Brian’s question is “Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?”. Or something like that (remember Metaxides, above). Go on, have a go at expressing your own version more precisely!

The point is, these things are hard. Understanding your data structures and their semantics, understanding the actual data and their provenance, understanding your questions, expressing them really clearly: these are hard things. That’s why informatics takes years to learn properly. Why people worry about how the parameters in a VCard should be expressed in RDF. It matters, and you can mess up if you get it wrong.

People sometimes say there’s so much dross and rubbish on the Internet, that searches such as Google provides are no good. But in fact with text, the human reader is mostly extraordinarily good at distinguishing dross from diamonds. A couple of side searches will usually clear up any doubts.

But people don’t do data well. Automated systems do, SPARQL queries do. We ought to remember a lot more from William Kent, about the ambiguities of concepts, but especially that bit about computers possessing incredibly little ordinary intelligence. I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…

When data shouldn’t be open?

2010-03-09T22:50:00.003+00:00

There is a big momentum these days about data being accessible, available, and re-usable. Increasingly people want open data; Science Commons have been recommending using CC0 to make the fully open status of data clear. More recently the Panton Principles start:

“Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.”

We’ve been big fans of Open Access at the DCC since its early days. We use a Creative Commons licence for our content by default. This blog was one of the earliest to be specific about a Creative Commons licence not only for the core text that we write, but also for the comments that you might add here.

So we strongly support the Open Data approach… where possible. For of course in some areas of science and research, there are data that cannot be open. Usually this is because the data are sensitive. They could be personal data, protected under Data Protection laws. Sensitive personal data (such as medical record data) has extra requirements under those laws. They could be financial microdata, commercially sensitive. Or perhaps data with strong commercial exploitation potential. They could be anthropological data, sensitive through cultural requirements. Research needs to go anywhere, whatever the issues; we can’t be constrained to only research where the data can be open.

So perhaps it’s as simple as that: some science should have open data, and some should have closed data?

Well, maybe not. Because the underlying issue of the Panton Principles must still apply. Research should be verifiable, whether through repeatable experiments or through re-analysable data. Unverifiable research is, well, unreliable- perhaps indistinguishable from fraud. Some access is needed; perhaps we should think of even sensitive data as Less Open Data rather than closed data.

So how do you go about dealing with sensitive data? Keep it secure, transfer securely, provide access under strict licences and controls in dat enclaves, aggregate, de-identify, anonymise, there are plenty of tricks in the book. That’s the topic of the 4^th Research Data Management Forum starting tomorrow in Manchester. I’ll hope to have more to write about what we learn later.

A Blue Ribbon for Sustainability?

2010-03-09T11:21:00.003+00:00

When we talk about long term digital preservation, about access for the future, about the digital records of science, or of government, or of companies, or the designs of ships or aircraft, the locations of toxic wastes, and so on being accessible for tens or hundreds of years, we are often whistling in the dark to keep the bogeys at bay. These things are all possible, and increasingly we know how to achieve them technically. But much more than non-digital forms, the digital record needs to be continuously sustained, and we just don’t know how to assure that. Providing future access to digital records needs action now and into that future to provide a continuous flow of the necessary will, community participation, energy and (not least) money. Future access requires a sustainable infrastructure. Ensuring sustainability is one of the major unsolved problems in providing future access through digital preservation.

For the past two years I have been lucky enough to be a member of the grandly named Blue Ribbon Task Force on Sustainable Digital Preservation and Access, along with a stellar cast of experts in preservation, in the library and archives worlds, in data, in movies… and in economics. C0-chaired by Fran Berman (previously of SDSC, now of RPI) and Brian Lavoie of OCLC, the Task Force produced an Interim Report (PDF) a year ago, and has just released its Final Report (Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information, also PDF). (The Task Force was itself sustained by an equally stellar cast of sponsors, including the US National Science Foundation and the Andrew W. Mellon Foundation, in partnership with the Library of Congress, the UK’s JISC, the Council on Library and Information Resources, and NARA.)

Sustainability is often equated to keeping up the money supply, but we think it’s much more than that. The Task Force specifically looks at economic sustainability; it says early in the Executive Summary that it’s about

“… mobilizing resources—human, technical, and financial—across a spectrum of stakeholders diffuse over both space and time.”

If you want a FAQ on funding your project over the long term you won’t find it here. Nor will you find a list of benefactors, or pointers to tax breaks, or arguments for your Provost. Instead you should find a report that helps you think in new ways about sustainability, and apply that new thinking to your particular domain. For one of our major conclusions is that there are no general, across the board answers.

One of the great things about this Task Force was its sweeping ambition. Not just content with bringing together a new economics of sustainable digital preservation, but thinking so broadly. This was never about some few resources, or this Repository or that Archive, it was about the preservation and long term access of major areas of our intellectual life, like scholarly communication, like research data, like commercially owned cultural content (the movie industry is part of this), and the blogosphere and variants (collectively produced web content). Looking at those four areas holistically rather than as fragments forced us to recognise how different they are, and how much those differences affect their sustainability. They aren’t the only areas, and indeed further work on other areas would be valuable, but they were enough to make the Task Force think differently from any activity I have taken part in before.

The report is, to my mind, exceedingly well written, thanks to Abby Smith Rumsey; it far exceeds the many rather muddled conversations we had during our investigations. It has many quotable quotes; among my favourites is

“When making the case for preservation, make the case for use.”

Reading the report is not without its challenges, as you might expect. It has to marry two technical vocabularies and make them understandable to both communities. I’ve been living partly in this world for two years, and still sometimes stumble over it; I remember many times screwing up my forehead, raising my hand and asking “Tell us again, what’s a choice variable?” And the reader will have to think about things like derived demand for depreciable durable assets, nonrival in consumption, temporally dynamic and path-dependent, not to mention the free rider problem. These concepts are there for a reason however; get them straight and you’ll understand the game a lot better.

And there are not surprisingly big underlying US-based assumptions in places, although the two resident Brits (myself and Paul Ayris of UCL) did manage to inject some internationalism. Further work grounded in other jurisdictions would be extremely valuable.

Overall I don’t think this report is too big an ask for anyone anywhere who is serious about understanding the economic sustainability of digital preservation and future access to digital materials. I hope you find the great value that I believe exists here.

DCC: A new phase, a new perspective, a new Director

2010-03-01T20:23:00.004+00:00

As the DCC begins its third phase today, I am delighted to announce the appointment of our new Director, Kevin Ashley, who will succeed me upon my retirement in April 2010.

Kevin Ashley has been Head of Digital Archives at the University of London Computer Centre (ULCC) since 1997, during which time his multi-disciplinary group has provided services related to the preservation and reusability of digital resources on behalf of other organisations, as well as conducting research, development and training. The group has operated the National Digital Archive of Datasets for The National Archives of the UK for over twelve years, delivering customised digital repository services to a range of organisations. As a member of the JISC's Infrastructure and Resources Committee, the Advisory Council for ERPANET, plus several advisory boards for data and archives projects and services, Kevin has contributed widely to the research information community. As a firm and trusted proponent of the DCC we look forward to his energetic leadership in this new phase of our evolution.

So far so press release. But I'd go further. I can't tell you how pleased I am with this appointment. As some readers will know, I have personally lobbied all and any potential candidates for this post since before I officially announced I was leaving. I understand we had some excellent candidates (I wasn't directly involved), more than one of whom might have made an excellent Director. But I'm particularly pleased at Kevin's appointment for several reasons: he is well engaged in the community including good connections with JISC, our major funder), he's tough enough to keep this tricky collaboration thing going, he has an excellent technical understanding, and he has great experience of actually managing this stuff in all its crusty awfulness. I particularly remember his discussion (on a visit to the Edinburgh Informatics Database Group) about issues like how best to deal with an archived dataset where they came across the characters "five" in a field defined as numeric! You can make it work or make it a record but not both...

So congratulations Kevin, and good luck!

Persistent identifiers workshop comes round again

2010-02-04T19:46:00.005+00:00

It seems to be the one event that people think is important enough to go to, even though they fear in their hearts that, yet again, not a lot of progress will be made. Most of those at yesterday’s JISC-funded Persistent Identifiers workshop yesterday had been to several such meetings before. For my part, I learned quite a lot, but the slightly flat outcome was not all that unexpected. It’s not quite Groundhog Day, as things do move forward slightly from one meeting to the next.

Part of the trouble is in the name. There is this tendency to think that persistent identifiers can be made persistent by some kind of technical solution. To my mind this is a childish belief in the power of magic, and a total abrogation of responsibility; the real issues with “persistent” identifiers are policy and social issues. Basically, far too many people just don’t get some simple truths. If you have a resource which has been given some kind of identifier that resolves to its address (so people can use it), and you change that address without telling those who manage the identifier/resolution, then the identifier will be broken. End of, as they say!

This applies whether you have an externally managed identifier (DOI, Handle, PURL) or an internally managed identifier (eg a well-designed HTTP URI… Paul Walk threatened to throw a biscuit at the first person to mention “Cool URLs”, but had to throw it at himself!).

Now clearly some identifiers have traction in some areas. Thanks to the efforts of CrossRef and its member publishers, the DOI is extremely useful in the scholarly journal literature world. You really wouldn’t want to invent a new identifier for journal articles now, and if you have a journal that doesn’t use DOIs (ahem!), you would be well-advised to sign up. It looks very affordable for a small publisher: $275 per year plus $1 per article.

Even for such a well-established identifier, with well-defined policies and a strong set of social obligations, things do go wrong. I give you Exhibit A, for example, in which Bryan Lawrence discovers that dereferencing a DOI for a 2001 article on his publications list leads to "Content not found" (apologies for the “acerbic” nature of my comment there). It looks like this was due to a failure of two publishers to handle a journal transfer properly; the new publisher made up a new DOI for the article, and abandoned the old one. Aaaaarrrrrrggggghhhhhhh! Moving a resource and giving it a new DOI is a failure of policy and social underpinning (let alone competence) that no persistent identifier scheme can survive! CrossRef does its best to prevent such fiascos occurring, but see social issues above. People fail to understand how important this is, or simple things like: the DOI prefix is not part of your brand!

Whether a DOI is the right identifier to use for research data seems to me a much more open question. The issue here is whether the very different nature of (at least some kinds of) research data would make the DOI less useful. The DataCite group is committed to improving the citability of research data (which I applaud), but also seems to be committed to use of the DOI, which is a little more worrying. While the DOI is clearly useful for a set of relatively small, unchanging digital objects published in relatively small numbers each year (eg articles published in the scholarly literature), is it so useful for a resource type which varies by many orders of magnitude in terms of numbers of objects, rate of production, size of object, granularity of identified subset, and rate of change? In particular, the issue of how a DOI should relate to an object that is constantly changing (as so many research datasets do) appears relatively un-examined.

There was some discussion, interesting to me at least, on the relationships of DOIs to the Linked Data world. If you remember, in that world things are identified by URIs, preferably HTTP URIs. We were told (via the twitter backchannel, about which I might say more later) that DOIs are not URIs, and that the dx.doi.org version is not a DOI (nor presumably is the INFO URI version). This may be fact, but seems to me rather a problem, as it means that "real DOIs" don't work as 1st class citizens of a Linked data World. If the International DOI Foundation were to declare that the HTTP version was equivalent to a DOI, and could be used wherever a DOI could be used, then the usefulness of the DOI as an identifier in a Linked Data world might be greatly increased.

A question that’s been bothering me for a while is when an “arms-length” scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier. We know that such well-managed HTTP identifiers can be extremely persistent; as far as I know all of the eLib programme URIs established by UKOLN in 1995 still work, even though UKOLN web infrastructure has completely changed (and I suspect that those identifiers have outlasted the oldest extant DOI, which must have happened after 1998). Such a local identifier remains under your control, free of external costs, and can participate fully in the Linked Data world; these are quite significant advantages. It seems to me that the main advantage of the set of “arms-length” identifiers is that they are independent of the domain, so they can be managed even if the original domain is lost; at that point, a HTTP URI redirect table could not be set up. So I’m afraid I joked on twitter that perhaps “use of a DOI was a public statement of lack of confidence in the future of your organisation”. Sadly I missed waving the irony flag on this, so it caused a certain amount of twitter outrage that was unintentional!

In fact the twitter backchannel was extremely interesting. Around a third or so of the twits were not actually at the meeting, which of course was not apparent to all. And it is in the nature of a backchannel to be responding to a heard discourse, not apparent to the absent twits; in other words, the tweets represent a flawed and extremely partial view of the meeting. Some of those who were not present (who included people in the DOI world, the IETF and big publishers) seemed to get quite the wrong end of the stick about what was being said. On the other hand, some external contributions were extremely useful and added value for the meat-space participants!

I will end with one more twitter contribution. We had been talking a bit about the publishing world, and someone asked how persistent are academic publishers. The tweet came back from somewhere “well, their salespeople are always ringing us up ;-) !

Digimap is 10

2010-01-21T16:38:00.003+00:00

I had a very enjoyable day yesterday helping EDINA celebrate 10 years of the Digimap service. What began as an eLib project and experiment with 6 Universities in 1996 has grown to a mature service with over 100,000 users, 45,000 of them active, in pretty much every UK University, and soon in UK schools as well.

In 1996 I was Programme Director of the eLib Programme, and my earliest email about Digimap was from the JISC money man, Dave Cook, on 30 January 1996 to Peter Burnhill of the Edinburgh Data Library (as it then was). Dave told Peter we were interested in his idea (for an Images project!) but had a few concerns (that the Ordnance Survey might not agree to let us use their mapping data; it’s hard to remember now how difficult some of those 1990s persuasions were!). Three days later, Dave was offering real money, although it had to be spent by 20 March that year. Done!

By late 1997 the Digimap project (*) had a trial service; I remember experimenting with it and having some problems (this was with Netscape 3 on a PowerMac Duo or something like that; woefully under-powered in retrospect). By the end of 1999, they were moving to a new GIS system, and we were beginning to discuss turning Digimap into a service, and that went live in January 2000. They had to get 37 subscribing Universities by a particular deadline, and I think managed 39 by somewhat earlier.

Since then the service has grown in scope, quality, usage and value. In my personal opinion (full disclosure, I’m not neutral here, having been associated with it through advisory groups of various kinds throughout its life), Digimap is the best service funded by JISC. Best in quality, best in professionalism, best in innovation, best in support. A lot of people deserve credit for that, and EDINA should all be extremely proud of what they have created. By the way, the OS have managed some major shifts in attitude over the years, from suspicious tolerance through to strong support, and the success is partly down to them, and to the efforts of the negotiators in what is now JISC Collections.

As well as various forms of OS mapping for GB (whose trademark names always escape me... and it is GB rather than UK, for weird historical reasons), Digimap now offers 4 “epochs” of historic maps from Landmark, plus Geology maps from BGS and Marine maps from SeaZone. Due to licence restrictions it is only available to registered staff and students at subscribing UK institutions, but I hope that those of you unlucky enough not to fall in that category can soon read more about it on the pages to be put up related to the celebration.

Digimap has been a bit clunky at times compared with the innovations introduced by some others, but with the new underlying GIS, the interfaces are being upgraded; they now have “slippy maps” (called Digimap Roam) on the base service, and it looks really smart and much more functional. It's tough for a small group to keep up with the likes of Google, Yahoo and MS! Soon this slippy map interface will be extended to the Historic service (“Ancient Roam”?), Geology (“Rock’n Roam”?) and Marine (a rather dull “H2Roam”!)… I think those might be internal names, but if you can complete the set with an even punnier marine name, who knows they might keep them!

The day was good fun, and we heard quite a bit about what Digimap is and how it is being used (far more widely than geography departments). The most exciting was a student project using Digimap and a GPS for a light aircraft CFIT-avoidance system (CFIT is Controlled Flight Into Terrain, referred to as “having a bad day”!). We heard from the data suppliers, with a bit more about what’s coming. It was interesting to hear the OS man talking about moves towards Linked Data; I wasn’t sure how that would square with the closed access, but I think I muddled my question (confused Linked Data with OGC web services, I suspect). The service providers didn’t appear to be talking to each other about Linked Data, which might be a good start.

A highlight was the closing keynote from Vanessa Lawrence, CEO of OS, clearly extremely supportive of OS. Choosing her words very carefully (she is not allowed to influence anyone) she outlined the government’s open data initiative and the consultation on its implications for the OS; this consultation closes late March 2010, but she urged us to make any responses, whether collectively or as private citizens well before then. The consultation isn’t simply “should we open up access to OS data?”, it’s much more “how can we open up access to OS data and still sustain the quality of the data into the future”.

The celebration ended with a reception and dinner, with an amusing after-dinner talk by Michael Parker, author of Map Addict. All in all, a very enjoyable and worthwhile day to celebrate a significant anniversary.

PS the twitter tag is #digimap10; I’m not going to tag the post with it, as I’ve got far too many one-time tags that are a pain to manage…

PPS (*) Unfortunately the original Digimap project pages seem to have vanished, and the earliest Wayback Machine gathers appear to be faulty; the first successful gather I can find is

http://web.archive.org/web/20011021051021/edina.ac.uk/digimap/

... which seems to refer to the service, not the project.

Linked data and staff contact pages

2010-01-21T14:42:00.003+00:00

You may remember that I am interested in the extent to which we should use Semantic Web (or Linked Data) on the DCC web site. After some discussions, I reached the conclusion that we should do so, but the tools were not ready yet (this isn’t quite an Augustinian “Oh Lord, make me good but not yet”; specifically, we are moving our web site to Drupal 6, the Linked Data stuff will not be native until Drupal 7, and our consultants are not yet up to speed with Linked Data). I have to say that not all our staff are convinced of the benefits of using RDF etc on the web site, and I have had a mental note to write more about this, real soon now.

I was reminded of this recently. I wanted to phone a colleague who worked at UKOLN, one of our partners, and I didn’t have his details in my address book. So I looked on their web site and navigated to his contacts page. Once there I copied his details into the address book, before lifting the phone to give him a ring. After the call (he wasn’t there; the snow had closed the office), I thought about that process. I had to copy all those details! Wouldn’t it be great if I could just import them somehow? How could that be? UKOLN have expertise in such matters, so I tweeted Paul Walk (now Deputy Director, previously technical manager) asking whether they had considered making the details accessible as Linked Data using something like FOAF. You can guess I’m not fully up to speed with this stuff, but I’m certainly trying to learn!

Paul replied that they had considered putting microformats into the page (I guess this is the hCard microformat), and then asked me whether my address book understood RDF, or if I was going to script something? I was pretty sure the answer to the second part was “no” as I suspect such scripting currently is beyond me, and told Paul that I was using MacOSX 10.6 Address Book; it says nothing about RDF, but will import a vcard. I was thinking that if there was appropriate stuff (either hCard microformat or RDFa with FOAF) on the page, I might find an app somewhere that would scrape it off and make a vcard I could import.

Paul’s final tweet was: “@cardcc see the use-case, not sure it's a 'linked data' problem though. What are the links that matter if you're scraping a single contact?”

Well, I couldn’t think of a 140-character answer to that question, which seemed to raise issues I had not thought about properly. What are the links that matter? Was it linked data, or just coded data that I wanted? Is this really a semantic web question rather than linked data? Or is it a RDF question? Or a vocabulary question? Gulp!

After some thought, perhaps Paul was as constrained by his 140 characters as I was. Surely a contacts page contains both facts and links within itself. See the Wikipedia page on FOAF for examples of a FOAF file in turtle for Jimmy Wales; the coverage is pretty much like a contacts page.

So Paul’s contact page says he works for UKOLN at the University of Bath, and gives the latter’s address (I guess formally speaking he works in UKOLN, an administrative unit, and is employed by the University); that his position in UKOLN is Deputy Director, that his phone, fax and email addresses are x, y and z. All of these are relationships between facts, expressible in the FOAF vocabulary. With RDFa, that information could be explicitly encoded in the HTML of the page and understood by machines, rather than inferred from the co-location of some characters on the page (the human eye is much better at such inferences). So there’s RDF, right there. Is that Linked Data? Is it Semantic Web? I’m not really sure.

More to the point, would it have been any greater use to me if it had been so encoded? A FOAF-hunting spider could traverse the web and build up a network of people, and I might be able to query that network, and even get the results downloaded in the form of a vcard that I could import into my Mac Address Book. That sounds quite possible, and the tools may already exist. Or, there may exist an app (what we used to call a Small Matter Of Programming, or a SMOP) that I could point at a web page with FOAF RDFa on it. Perhaps that’s what Paul was after in relation to scripting. Maybe the upcoming Dev8D might find this an interesting task to look at?

What other things could be done with such a page? Well, Paul or others might use it to disambiguate the many Paul Walk alter egos out there. You’ll see I have a simple link to Paul’s contact page above, but if this blog were RDF-enabled, perhaps we could have a more formal link to the assertions on the page, eg to that Paul Walk’s phone number, that Paul Walk’s email address, etc.

Well I’m not sure if this makes sense, and it does feel like one of those “first fax machine” situations. However FOAF has been around for a long while now. Does that mean that folk don’t perceive an advantage in such formal encodings to balance their costs, or is this an absence of value because of a lack of exploitable tools? If so, anyone going to Dev8D want to make an app for me?

(It’s also possible of course that Paul doesn’t want his details to be spidered up in this way, but I guess none of us should put contact details on the web if that’s our position.)

By the way, I found a web page called FOAF-a-matic that will create FOAF RDF for you. Here's an extract from what it created for me, in RDF:

<foaf:Person rdf:ID="me"> <foaf:name>Chris Rusbridge</foaf:name> <foaf:title>Mr</foaf:title> <foaf:givenname>Chris</foaf:givenname>
<foaf:family_name>Rusbridge</foaf:family_name>
<foaf:mbox rdf:resource="mailto:c.rusbridge@xxxxx"/> <foaf:workplaceHomepage rdf:resource="http://www.dcc.ac.uk/"/>
</foaf:Person>

What could I do with that now?

Scholarly HTML would be nice, but...

2010-01-13T15:34:00.004+00:00

I'm quite interested in the idea of Scholarly HTML, as espoused in Pete Sefton's blog, and I've commented on some of Peter Murray Rust's hamburger PDF comments previously (although I do think a lot of people confuse wild PDF with well-made, should one say Scholarly PDF). I've always been slightly worried by one thing though.

A well-known advantage of PDF is that it pretty much assures I can save a document, share it, move it around etc and it will still be intact and readable. That's one of the reasons it's so popular.

Mostly we don't do that with HTML. Mostly we just point to it. But if I see an article these days, I want it on my computer if I'm allowed; this let's me study it at leisure, drop it in my Mendeley system, etc. As pointed out, that works a treat with PDF, and pretty well with Word or OpenOffice documents as well. This applies even where the document is quite heavily compound, with many embedded images, tables etc.

But if I try saving a HTML document to my hard disk, nothing very standard happens. OK, if I use Safari on my Mac, I get a .webarchive file, which is quite nice as I can do all the things with it that I could do with a PDF and Word etc, and when I open it later it will be as it was before, with all the images in place. But neither IE nor Firefox seem capable of opening a .webarchive file.

If I try saving the same article from Firefox, I get a .html file with the main article in it, and a directory with associated files in it (eg images). Safari does seem capable of opening this combination, but it's pretty ugly, and hard to move around. I haven't tried IE as I don't have easy access to it.

Is there in existence or development a standard approach to packaging the HTML and associated files that would be as convenient as the .webarchive, but usable across all browsers? If so, Scholarly HTML would be that little bit closer!

Persistence of domain names

2010-01-13T15:24:00.003+00:00

I had a chat before Christmas with Henry Thompson, who works both in Edinburgh Informatics and also on the W3C TAG. Insofar as the Internet is important in sustaining long term access to information in digital form, there is a sustainability problem that we rather seem to have ignored. Everything on the Internet (literally) depends on domain names, and these are only ever rented. There is no mechanism for permanently reserving a domain name. Domain names can be lost by mistake (overlooking a bill, perhaps having moved in the interim and not informed the relevant domain name registrar), but they can also be lost on business failure. Although domain names can be a business asset, I understand that the registrars have some discretion on transfers, and in particular one cannot make a "domain name will" seeking transfer of the domain name to some benevolent organisation. Note, the mechanism for renting domain names has sustainability advantages, providing sustainability to important services that underpin the DNS.

There are two kinds of problem, one on a massive scale and one more fine-grained. The massive problem is that the entire infrastructure of the Internet depends on URIs, most of which are http URIs that in turn depend on the domain name system. So there are a number of organisations whose domain names are embedded in that infrastructure in a way and to an extent that is very difficult to change. W3C is clearly such an organisation. Many of these organisations seem rather fragile (not a comment on W3C, by the way, although its sustainability model is opaque to me). Should they fail and the domain names disappear, the relevant URIs will cease to work and various pieces of Internet machinery will fall apart.

(By the way, this does seem to be one case where a persistent ID that is independent of the original domain, such as a DOI, has advantages over a HTTP URI plus a redirect table. If the domain name no longer exists, you can't get to a redirect, whereas someone can still relink the DOI to a new location.)

On the more fine-grained scale, many documents (particularly in HTML) are not easily separable from their location, depending on other local files and documents. In addition of course, documents in some sense exist through their citations or bookmarks, that begin to exist separately from the document. Moving a document to a new domain can make it "fail" or disappear. So sustainability is linked to the domain as well as the other preservation factors.

This seems to me to be not at all a technical problem, but it seems to have legal/regulatory, governance, social, business and economic aspects.

Among the solutions might be creating a new top level domain designed for persistence, with different rules of succession, etc. Another (either instead of or in conjunction with the first) might be creating an organisation designed for persistence, to hold endowed domain names. Somehow the ongoing revenue stream for those underpinning services must be retained indefinitely into the future.

We don't think we have the answers, but we do think there is a problem here; I'm not yet sure if we have articulated it accurately at all. I would appreciate any comments. Thanks,

Director of Digital Curation Centre: still time to apply

2010-01-13T13:05:00.001+00:00

I’m particularly keen that there be a good slate of candidates for this post, for which applications close on Friday 15 January, 2010. The details can be found at http://www.jobs.ed.ac.uk/vacancies/index.cfm?fuseaction=vacancies.detail&vacancy_ref=3012085 (sorry about the dodgy URL; I hope it works)

The further details say

"The mission of the Digital Curation Centre is to help build capacity, capability and skills for data curation across the UK higher education research community, while supporting and promoting emerging data curation practice. It also has a key role in supporting JISC, especially its new research data management programme. Overall, the DCC is an agent for change, committed to the diffusion of best practice in the curation of digital research data across the Higher Education sector, and providing an authoritative source of advocacy, resources and guidance to the UK research community. This mission is informed by five priorities:

• to identify, gather, record and disseminate curation best practice, providing access to resources, tools, training and information that will equip data practitioners to make informed decisions regarding the management of their data assets;

• to facilitate knowledge exchange between those currently and newly engaged in the generation and management of digital research data;

• to build and support a community of informed practitioners that has the capacity to sustain itself, with the capability to manage and curate its data appropriately;

• to identify crucial and important innovations in data curation, and seek additional resources to provide them;

• to support JISC, especially in its repository, preservation and data management programmes.

To achieve this, the Director must be a persuasive advocate for better curation and management of research data, on a national or international scale. Able to listen and engage with researchers and with research management, publishers and research funders, the Director will build a strong, shared vision of the changes needed, and the ability (working with others in the DCC and beyond) to mobilise the community towards that end."

So if this fits you, or you know someone that it fits, please persuade that person to apply!

By the way, although the DCC may not escape further budget cuts, like all public services in the UK, I have been told that we are funded from core JISC funding rather than capital funding, and as such Phase 3 will not be curtailed as the proposed JISC Managing Research Data programme has been.

[NOTE: This vacancy has now CLOSED; no further applications will be accepted!]

Digital Curation Centre User Survey 2009: Highlights

2010-01-05T17:12:00.003+00:00

My colleague Angus Whyte has provided the following brief summary of two surveys carried out in Phases 1 and 2 of the Digital Curation Centre, in 2006 and 2009 respectively, as part of our evaluations. In retrospect, we might have done better revising the questions for the second survey rather more than we did; nevertheless I thought it worth while sharing this with you.

Angus writes:

In 2009 DCC users were surveyed, repeating a similar survey carried out in 2006. In the highlights below we draw conclusions both from the more recent results and also changes over the 3 year period. Both surveys were publicised on the DCC website and via several mailing lists, principally the DCC-Associates and (in 2009) the JISC sponsored Research-Dataman list.

Our conclusions take into account that the online questionnaire was self-completed by a self-selected group of respondents (75 in 2009 and 125 in 2006). DCC Associates (640 approx.) provided the bulk of the responses[1]. The results indicated broad patterns, relatively wide differences and consistent responses over the two surveys, even though these are not taken to be statistically representative.

Highlights

In both surveys around 90% of respondents are familiar with the term ‘digital curation’ and regard it as a critical issue within their project or unit. The DCC is consistently given as the main source of information on curation issues by around 70% of respondents, with “on the job challenges/ research” second at around 60%.

Between the two surveys there is a large jump (from 13% to 32%) in the number of respondents indicating that DCC has been “very effective” in raising awareness about digital curation, and those believing it to be “slightly effective” has correspondingly fallen from 53% to 31%.

Of a list of DCC resources, five are identified as “most helpful” by at least 1 in 5 of the 2009 survey respondents, these being (in descending order) the DCC website, Briefing Papers (of various sorts), the DCC Curation Lifecycle Model, Case Studies, and the Digital Curation Manual.

Respondents universally associate digital curation with “ensuring the long-term accessibility and re-usability of digital information”, and large majorities (around 90%) also relate it to “performing archiving activities on digital information such as selection, appraisal and retention” and “ensuring the authenticity, integrity and provenance of digital information are maintained over time”. Rather lower but still significant numbers (around 60%) associate digital curation with “managing digital information from its point of creation” and “managing risks to digital information” – although many more highlight the latter in 2009 (up to 84% from 61%).

Curation or preservation addresses risks to the respondents’ organisations with “loss of organisational memory” consistently topping their list (identified by around 75% of respondents) and “business risks” second, identified by just under half, again across both surveys.

More than two thirds indicate that their main reasons for curating and preserving digital information are its educational/research or historical value; in both years a minority cites other reasons. Similarly, the main obstacles are indicated as financial or staff resources, with around half also indicating lack of awareness or appropriate policies.

For around 40% of respondents, management and preservation of digital information has an indefinite timescale. For a further 15% or so it is “beyond the life of the project/organisation”, and similar numbers indicate these are tasks “for the life of the project/organisation”.

The 2009 survey respondents are no strangers to the ‘data deluge’, most dealing with at least 100Gb and some (7%) more than 100Tb. Overall 79% expect this to increase in the next two years, surprisingly 3% do not, while 7% do not know. Most need to manage a mixture of open and proprietary formats, and report a wide variety of formats in use, predominantly common office applications, PDF documents and multimedia formats. Curation and preservation challenges are most frequently identified with obsolete proprietary formats. Image, video, and geospatial data are also often identified as challenges, as are web sites combining these.

Respondents were also asked in 2009 about re-use, and around a third indicate that research data is re-used internally, with similar numbers offering data generated by their project/unit for re-use by others, or re-using external data.

Access issues facing research projects/units are identified in both surveys and along similar lines; intellectual property rights (e.g. copyright) is the most frequently cited issue, followed by “privacy or ethical issues”, however “embargo on research findings” is least prevalent, identified by only a fifth of respondents.

Asked about funding for curation and preservation, responses show no clear picture. Around half of 2009 respondents indicate funding is “accounted for in project or institutional budget”. A large minority have no explicit funding for curation and preservation, and where resources are available these are pooled from other funded areas (e.g. IT budget for project or organisation) or research grants. Spending on curation/preservation is less than £50,000 (for around half of those respondents who were aware of this). Around half are unsure whether spending will increase or decrease, with the remainder being evenly split.

Detailed questions and response data are available on request.

Angus Whyte, Digital Curation Centre

[1] The DCC Associates membership list includes UK data organisations, leading data curators, overseas and supranational standards agencies, and industrial/business communities. Currently research data creators are under-represented (information from registration details).

More activity on semantic publishing

2009-12-17T23:07:00.004+00:00

If you saw tweets from @cardcc today, you might realise I’ve been very interested in a couple of recent developments in semantic publishing. I wrote earlier about linking data to journal articles, including David Shotton’s adventures in semantic publishing. David’s work was one of those included in the review article in the Biochemical Journal by Attwod, Kell, McDermott et al (2009). The article ranged over the place of ontologies and databases, science blogs, and various experiments. These included

RSC and Project Prospect,
The ChemSpider Journal of Chemistry,
The FEBS Letters experiment,
PubMed Central and BioLit,
Shotton’s PLoS experiment,
Elsevier Grand Challenge,
Liquid Publications,
The semantic Biochemical Journal experiment.

The latter was the real focus of the article, available in PDF, but which if read through a special reader called Utopia Document displayed some active capabilities. These included the ability to visualise and rotate 3-d images of proteins, to see tables represented as graphs (or vice versa) and to link to entries in nucleic acid databases. The capabilities were perhaps a bit awkward to spot and to manipulate, but still interesting. This article is (gold) open access. Other articles in the issue have also been instrumented in this way.

It’s clearly early days for Utopia, and I wasn’t wholly impressed with it as a PDF reader, but I was certainly very excited at some of what I read and saw.

I also read today a very different article (I think not available on open access), by Ruthensteiner and Hess (2008). They describe the processes in making 3-d models of biological specimens, and presenting them in PDF, readable by a standard Acrobat Reader. The 3-d capability was at least as good as if not better than the Utopia results.

Because it’s getting late, I’ll end with my last tweet:

“My head is spinning with semantic article possibilities. I hope some get picked up in new #jiscmrd proposals, see http://www.jisc.ac.uk/fundingopportunities/funding_calls/2009/12/1409researchdata.aspx"

Attwood, T. K., Kell, D. B., McDermott, P., Marsh, J., Pettifer, S. R., Thorne, D., et al. (2009). Calling International Rescue: knowledge lost in literature and data landslide! The Biochemical journal, 424(3), 317-33. doi: 10.1042/BJ20091474.

Ruthensteiner, B., & Hess, M. (2008). Embedding 3D models of biological specimens in PDF publications. Microscopy research and technique, 71(11), 778-86. doi: 10.1002/jemt.20618. Pubmed abstract.

Linked Statistics & other data

2009-12-15T23:11:00.003+00:00

Someone pointed me to the blog Jeni's Musings, written by Jeni Tennison. I don't know who Jeni is, but there's some really interesting stuff here, with some obvious links to UK Government Data activity. Among other things, there's a post about expressing statistics with RDF (it looks pretty horribly verbose, but it's the first attempt I've seen to address some real data that could be relevant to science research), an thoughtful post about the provenance of linked data, and a series of 5 posts (from here to here) on creating linked data.

Director, The Digital Curation Centre (DCC)

2009-12-09T17:39:00.003+00:00

So, time to come fully out into the open, after various coy hints over the past week or so. I'm planning to retire around the start of DCC Phase 3. Adverts are starting to appear with the following text:

"We wish to appoint a new Director to take the DCC forward into an exciting third phase, from March 2010. You must be a persuasive advocate for better management of research data on a national and international scale. Able to listen to and engage with researchers and with research management, publishers and research funders, you will build a strong, shared vision of the changes needed, working with and through the community. You should have a sound knowledge of all aspects of digital curation and preservation, an understanding of higher education structures and processes, and appropriate management skills to be the guiding force in the DCC’s progress as an effective and enduring organisation with an international reputation.

"This post is fixed term for three years. [I would suggest: in the first instance...]
Closing date: Friday 15th January 2010."

This is a great job, and I think an important one. I have been bending the ears of many people in the last couple of weeks to ask them to think of appropriate people to point this advert at (yes, I know that's rotten English; it's been a long week).

Further details will be on the University of Edinburgh's jobs web site, http://www.jobs.ed.ac.uk/ (they weren't there when I checked a few minutes ago, maybe tomorrow).

Leadership opportunities

2009-12-08T23:36:00.002+00:00

Those interested in leadership in Digital and Data Curation should keep an eye on the relevant UK press and lists over the next week or so for anything of interest...

Last volume 4 issue of IJDC just published

2009-12-08T23:23:00.006+00:00

On Monday this week, we published volume 4, issue 3 of IJDC. From one respect, this was a miracle of speed publishing, as 7 of the peer-reviewed articles had just been delivered the previous week as part of the International Digital Curation Conference. But we also included an independent article, plus 1 peer-reviewed paper and 3 articles with a rather longer gestation, originating in papers at iPres 2008! There are good and bad reasons for that too lengthy delay.

I wrote in the editorial that I would reproduce part of it for this blog, to attract comment, so here that part is.

"But first, some comments on changes, now and in the near future, that are needed. One major change is that Richard Waller, our indefatigable Managing Editor, has decided to concentrate his energies on Ariadne. Richard has done a grand job for us over the past few years, in his supportive relationships with authors, his detailed and careful editing, and in commissioning general articles. To quote one author: “I note that the standard of Richard’s reviewing is much better than [a leading publisher's]; they let an article of mine through with very bad mistakes in the references without flagging them for review, and were not so careful about flagging where they had changed my text, not always for the better”. The success of IJDC is in no small way a result of Richard’s sterling efforts over the years. I am very grateful to him, and wish him well for the future: Ariadne authors are very lucky!

"Looking to the future of IJDC, we will have Shirley Keane as Production Editor, working with Bridget Robinson who provides a vital link to the International Digital Curation Conference, and several other members of the DCC community. We are seeking to work more closely with the Editorial Board in the commissioning role and to draw on the significant expertise of this group.

"In parallel, we have been reviewing how IJDC works, and are proposing some changes to enhance our business processes and I shall be writing to the Editorial Board shortly. For example, we expect to include articles in HTML- as well as PDF format, to introduce changes to reduce the publishing lead times, and a possible new section with particular practitioner orientation. As part of reduced publishing lead times, we are considering releasing articles once they have been edited after review, leading to a staggered issue which is “closed” once complete. I’m planning to repeat this part of the editorial in the Digital Curation Blog [here], perhaps with other suggestions, and comments [here] would be very welcome."

Oh, we then did a little unashamed puffery...

"We are, of course, very interested in who is reading IJDC, and the level of impact it is having on the community. In order to find out, Alex Ball from UKOLN/DCC has been trying several different approaches in order to get as full a picture as possible.
One approach we have used is to examine the server log for the IJDC website. The statistics for the period December 2008 to June 2009 show that around 100 people visit the site each day, resulting in about 3,000 papers and articles being downloaded each month. It was pleasing to discover we have a truly global readership; while it is true that a third of our readers are in the US and the UK, our content is being seen in around 140 countries worldwide, from Finland to Australia and from Argentina to Zimbabwe. As one would expect, we principally attract readers from universities and colleges, but we also receive visits from government departments, the armed forces and people browsing at home.

"The Journal is also having a noticeable impact on academic work. We have used Google Scholar to collect instances of journal papers, conference papers and reports citing the IJDC. In 2008, there were 44 citations to the 33 papers and articles published in the Journal in 2006 and 2007, excluding self-citations, giving an average of 1.33 citations per paper. Overall, three papers have citation counts in double figures. One of our papers (“Graduate Curriculum for Biological Information Specialists: A Key to Integration of Scale in Biology” by Palmer, Heidorn, Wright and Cragin, from Volume 2, Issue 2) has even been cited by a paper in Nature, which gives us hope that digital curation matters are coming to the attention of the academic mainstream."

OK, so we're not Nature! Nevertheless, we believe there is a valuable role for IJDC, and we'd like your help in making it better. Suggestions please...

(I made this plea at our conference, and someone approached me immediately to say our RSS feed was broken. It seems to work, at least from the title page. So if it still seems broken, please get in touch and explain how. Thanks)

IDCC 09 Delegate Interview: Melissa Cragin and Allen Renear

2009-12-08T21:39:00.003+00:00

Melissa Cragin and Allen Renear [corrected; apologies] from the University of Illinois, who will be chairing IDCC next year in Chicago, give their reactions to IDCC 09 and a hint of their preparations for next year's event in this final, double-act, video interview....

[click here to view this video at Vimeo]

IDCC 09 Delegate Interview: Neil Grindley

2009-12-08T21:35:00.002+00:00

JISC Programme Manager and session chair Neil Grindley gives us his response to IDCC 09 in this quick video interview as the event draws to a close...

[click here to view this video at Vimeo]

IDCC 09 Keynote: Prof. Ed Seidel

2009-12-08T21:30:00.002+00:00

In a content-packed keynote talk, Ed Seidel wanted to give us a preview about what types of project are driving the National Science Foundation's need to think about data, the cyber infrastructure questions, the policy questions and the cultural issues surrounding data that are deeply routed in the scientific community.

To illustrate this initially, Seidel gave the example of some visualisation work on colliding black holes that he had conducted whilst working in Germany with data collected in Illinois, explaining that in order to achieve this he had to do a lot of work on remote visualisation and high performance networking – but that moving the data by network to create the visualisations was not practical, so the team had to fly to Illinois to do the visualisations, then bring the data back. He also cited projects that are already expecting to generate an exabyte of data – vastly more than is currently being produced – so the problem of moving data is only going to get bigger.

Seidel looked first to the cultural issues that influence scientific methods when it comes to the growing data problem. He demonstrated the 400-year-old scientific model of collecting data in small groups or as individuals, writing things down in notebooks and using small amounts of data in that could be measured in kilobytes in modern terms, with calculations carried out by hand. This has not change from Galileo and Newton through to Stephen Hawkins in the 1970's. However, within 20-30 years, the way of doing science changed – with teams of people working on projects using high performance computers to create visualisations of much larger amounts of data. This is a big culture shift, and Seidel pointed out that many senior scientists are still trained in the old method. You now need larger collaborative teams to solve problems and manage the data volumes to do true data-driven science. He used the example of the Hadron Collider, where scientists are looking at generating tens of petabytes of data, which need to be distributed globally to be analysed – with around 15,000 scientists working on around six experiments.

Seidel then went on to discuss how he sees this trend of data sharing developing, using the example of the recent challenge of predicting the route of a hurricane. This involved the sharing of data between several communities to achieve all the necessary modelling to respond to the problem in a short space of time. Seidel calls the groups solving these complex problems “grand challenge communities”. The scientists involved with have three or four days to share data and create models and simulations to solve these problems, but will not know each other! The old modality of sharing data with people that you know will not work and so these communities will have to find ways to come together dynamically to share data if they are going to solve these sorts of problems. Seidel predicted that these issues are going to drive both technical development and policy change.

To illustrate the types of changes already in the pipeline, Seidel cited colleagues who are playing with the use of social networking technologies to help scientists to collaborate – particularly Twitter and Facebook. Specifically, they have set up a system whereby their simulation code tweets its status, and have also been uploading the visualisation images directly into Facebook in order to share it.

Seidel noted that high dimensional, collaborative environments and tremendous amounts of bandwidth are needed, so technical work will be required. The optical networks often don't exist – with universities viewing such systems like the plumbing and funding bodies not looking to support the upgrade of such infrastructure. Seidel argued that we need to find ways to catalyse this sort of investment.

To summarise, Seidel highlighted two big challenges in science trends at the moment: multi-skilled collaborations and the dominance of data, which are both tightly linked. He explained that he had calculated that compute, data and networks have grown 9-12 orders of magnitude in 20-30 years after 400 years unchanged, which shows the scale of the change and the change in culture that it represents.

NSF has a vision document which highlights four main areas – virtual organisations for distributed communities, high performance computing, data visualisation, and learning and work practices. Focusing on the “Data and Visualisation” section, Seidel quoted their dream for data to be routinely deposited in a well-documented form, regularly and easily consulted and analysed by people and are openly accessible, protected and preserved. He admitted this is a dream that is no where near being realised yet. He recognised that there need to be incentives for the changes and new tools to deal with the data deluge. They are looking to develop a national data framework, but emphasised that the scientific community really needs to take the issues to heart.

Taking the role of the scientist, Seidel took us through some of the questions and concerns which a research scientist may raise in the face of this cultural shift. They included concerns about replication of results – which Seidel noted could be a particular problem when services come together in an ad hoc way, but needs to be addressed if the data produced is to be credible.

Seidel moved on to discuss the types of data that need to be considered, in which he included software. He stressed that software needs to be considered as a type of data and therefore needs to be given the same kind of care in terms of archiving and maintenance as traditional scientific collection or observation data. He also includes publications as data, as many of these are now in electronic form.

In discussing the hazards faced, Seidel noted that we are now producing more data each year than we have done in the entirety of human history up to this point – which demonstrates a definite phase change in the amount of data being produced.

The next issue of concern Seidel highlighted was that of providence – particularly how we collect the metadata related to the data that we are considering how to move around. He admitted that we just simply don't know how to do this annotation at the moment, but this is being worked on.

Having identified these driving factors, Seidel explained the observations and workgroup structures that NSF has in place to think more deeply and investigate solutions to these problems, which includes the DataNet project. $100 million is being invested in five different projects as part of this programme. Seidel hopes that this investment will help catalyse the development of a data-intensive science culture. He made some very “apple-pie” idealistic statements about how the NSF sees data, and then used these to explain why the issues are so hard, emphasising the need to engage the library community who have been curating data for centuries, and the need to consider how to enforce data being made available post-award.

Discussions at the NSF are suggesting that each individual project should have a data management policy which is then peer-reviewed. They don't currently have consistency, but but this is the goal.

In conclusion, Seidel emphasised that there are many more difficult cases are coming... However, the benefits of making data available and searchable – potentially with the help of searchable journals and electronic access to data – are great for the progress of science, and the requirement to make many more things available than before if percolating down from the US Government to the funding bodies. Open access to information online is a desirable priority and clarification of policy will be coming soon.

IDCC 09 Keynote: Timo Hannay

2009-12-08T15:51:00.002+00:00

Timo Hannay presented a talk entitled 'From Web 2.0 to the Global Database”, providing a publishing perspective on the need for cultural change in scientific communication.

Hannay took a step back to take a bigger picture view. He began by giving an overview to his work at Nature, noting that the majority of their business is through the web – although not everyone reads the work electronically, they do access the content through the web. He then explained how journals are being coming more structured, with links providing supplementary links and information. He admitted that this information is not yet structured enough, but it is there – making the journal more like databases.

Hannay moved on to explain that Nature is getting involved in database publishing. They help to curate and peer-review database content and commission additional articles to give context to the data. This is a very different way of being a science publisher – so the change is not just for those doing the science!

After taking us through Jim Gray's four scientific paradigms, Hannay asked us to think back to a talk by Clay Skirky in 2001, which led to the idea that the defining characteristic of the computer age is not the devices, but the connections. If a device is not connected to the network, it hardly seems like a computer at all. This led Tim O'Reilly to develop the idea of the Internet Operating System, which morphed into the name “Web 2.0”. O'Reilly looked at the companies that survived and thrived after the dot com bubble and created a list of features which defined Web 2.0 companies, including the Long Tail, software as a service, peer-to-peer technologies, trust systems and emergent data, tagging and folksonomies, and “Data as the new 'Intel Inside'”.... the idea that you can derive business benefit from powering data behind the scenes.

Whilst we have seen the Web 2.0 affect science, science blogging hasn't really taken off as much as it could have done – particularly in the natural sciences – and is still not a main stream activity. However, Hannay did note some of the long term changes we are seeing as a result of the web and the tools it brings: increasing specialisation, more information sharing, smaller 'minimum publishable unit', better attribution, merging of journals and databases – with journals providing more structure to databases – and new roles for librarians, publishers and others. Hannay asserted that these changes are leading, gradually, to a speeding up of discovery.

Hannay took us through some of the resources that are available on the web, from Wikipedia to PubChem and ChemSpider, where the data is structured and annotated through crowd sourcing to make the databases searchable and useable.

He asserted that we are moving away from the cottage-industry model of science, with one person doing all the work in the process from designing the experiment to writing the paper. We are now seeing whole teams with specialisms collaborating across time and space in a more industrial-scale science. Different areas of science at at different stages with this.

Hannay referred to Chris Anderson's claim on Wired Magazine that we no longer need theory. He rejected this, but did agree that more is different, so we will be seeing changes. He gave the example of Google, which didn't develop earlier in the history of the web simply because it was not necessary until the web reached a certain degree of scale for it to be useful.

As publishers, Hannay believes that have a role to play in helping to capture, structure and preserve data. Journals are there to make information more readable for human beings, but they need think about how they present information to help both humans and computers to search and access information as both are now just as important.

All human knowledge is interconnected and the associations between facts are just as important as the facts themselves. As we reach that point when a computer not connected to the network is not really a computer, Hannay hopes we will reach a point where a fact not connected to other facts in a meaningful way will hardly be considered a fact. One link, one tag, one ID at a time, we are building a global database. This may be vast and messy and confusing, but it will be hugely valuable – like the internet itself as the global computer operating system.

IDCC 09: Richard Cable Discusses BBC Lab UK and Citizen Science

2009-12-08T15:10:00.003+00:00

Richard Cable, the Editor of BBC Lab UK, opened his talk with the traditional representations of science on television (“Science is Fun” vs “blow things up”) in his presentation about the BBC Lab UK initiative – designed to involve the public with science.

Cable used this comparison to illustrate that most “citizen science” seems to involve the mass engagement of people with science, whereas Lab UK is aimed at mass participation in science. The project is about new learning: creating scientifically useful surveys and experiments with the BBC audience, online.

He discussed the motives of the audience and how this forms a fundamental part of both the design of the experiment and the types of experiment that they can usefully conduct. For the audience, the experiment has to be a bit of a “voyage of self-discovery” where they learn something about themselves as well as contributing data to the wider experiment – a more altruistic motive. Cable emphasised that they work with real scientists, properly designed methodologies, ethics approval and peer-review systems so that the experiments are built on solid science and therefore make a useful contribution to scientific knowledge, rather than just entertainment for the audience.

To illustrate, Cable took us through the history of BBC online mass participation experiments which have led to the development of the new Lab UK brand. This included their Disgust experiment, involving showing users images and asking them to judge whether they would touch the item in the image. This was driven by a television programme, which directed the audience to the website after the show. He also discussed Sex ID, which worked the opposite way round – with the results of the experiment feeding the content of the programme. 250,000 participants got involved over 3 months to take a series of short flash tests which identified the sex of their brains. This exemplified his point about giving the audience a motive – with them learning something about themselves as a result of participation.

In continuing this back-story, Cable briefly introduced Stress, which was the prototype launch for the Lab UK brand itself, which was linked into the BBC's Headroom initiative. He noted that the general public would rather something that gave some lifestyle feedback, rather than just being purely sciency. This experiment – a series of flash tasks and uncomfortable questions – has since been taken down.

The more recent Brain Test Britain was a higher profile experiment launched by the programme Bang Goes The Theory which was the first longitudinal experiment, where the audience were asked to revisit the site over a period of 6 weeks to participate, rather than one-off site visit, survey model of experiment used in the previous examples. They were expecting 60,000 participants, given the issue of retention, to help establish whether brain training actually works. This was a proper clinical trial with academic sponsors from the Alzheimer's Society – the results of which will be announced in a programme later next year.

The fourth experiment Cable described was The Big Personality Test, linked with the Child Of Our Time series following children born in the year 2000. They used standard accepted models for measuring personality, to give detailed feedback to participants. They were seeking to answer the question: “Does personality shape your life or does life shape your personality?”. They attracted 100,000 participants in 3 days, which was vastly more uptake than expected. The level of data they have collected already is becoming unmanageable, so this means they are having to re-evaluate the duration of the experiment.

In the future, they are hoping to take their experiments social using Facebook and Twitter as part of the method.

Cable summed up these experiments by highlighting the rules they have found they need to apply when designing such experiments. These include a low barrier to entry, a clear motive for participation, a genuine mass participation requirement, a sound scientific methodology and an aim that will contribute to new knowledge.

Cable went on to discuss the practicalities of how experiments are designed from conception to commissioning. This involves selecting sponsor scientists, who help to design the experiment and analyse the results. He explained the selection process, which entails finding respected scientists who are flexible and adaptable to this experiment format. The role of this “sponsor academic” is to collaborate on experiment design, advise on the ethics processes, interpret the results and then write the peer-review paper resulting from the data and publish their findings.

The data collected from these experiments comes in two forms: personally identifiable data and anonymisable data. This means that the scientist cannot trace individual participants back, but the BBC (or three people within the BBC) can trace people back in the event that they need to manage the database and delete entries if requested. Cable also explained that the data they ask for is driven by the science, not editorial decisions by television programme makers, using standard measures where possible.

Finally, Cable discussed the actual data and the curation issues surrounding it. All the data from Lab UK is stored in one place, connected by the BBC ID system, which enables them to start doing secondary analysis from the data where participants have taken part in multiple experiments. The sponsor academics have a period of exclusivity before the data becomes available for academic and educational purposes only. However, they are still grappling with issues of data visualisation so they can make this data comprehensible to the general public, and data storage issues – as the BBC does not do long term data storage. There are precedents – including the People's War project where people's memories of the World War II were collected and hosted online. This data has now been passed to the British Museum and forms part of their collection. He also noted that there may be demands from the ethics committee on how long they can keep the data and before it may be destroyed.

IDCC 09 Interview: Chris Rusbridge

2009-12-08T13:26:00.001+00:00

Chris Rusbridge, Director of the DCC, gives us his views of IDCC 09 and his thoughts about the future of the DCC as it moves towards Phase 3 in this video interview:

[click here to view this video at Vimeo]