Thursday 6 October 2011

Thoughts before "The Future of the Past of the Web"

Tomorrow I'm going to be in London for a joint JISC/DPC event on web archiving, "The Future of the Past of the Web" (hashtag #fpw11 if you're so inclined.) It's the third in an occasional series; I gave the closing presentation at the second event and I have been asked to be on a closing panel this time round. One of the things we've been asked to reflect on is what changes have taken place since the last event and how far our expectations have been realised. I thought it would be useful to set my thoughts on this down in advance, partly to help me articulate my own thinking. It will be interesting to see how various views develop during the panel session tomorrow.

Image Courtesy Martin Dodge's Cybergeography collection
Looking back at my concerns in mid-2009 I'm greatly reassured. There were a number of worrying trends apparent in web archives at that time and an apparent lack of bold vision in how we might use web archives in the future - or even in the present. My fear was that the collecting policies, preservation policies and interfaces offered were all taking a very human and document-centric view of what a web archive should do. In OAIS terms, the Designated Community was people who wanted to view individual old web pages having done a search for a particular site, or possibly for a keyword of some sort. The National Archives had taken one incremental but powerful step beyond that, automatically linking archived web pages to 404 pages on government web sites via simple plugins for Apache & IIS, but in the end this still involved serving individual pages for people to read.

That's a valid use case, but by no means the only ones. I set out a few other things we might want to be able to do but could not with the interfaces that web archives gave us.

  • What search results would we have got on the web of 1998 using the search engines of 1998?

  • What results would we have got using current search engines on the web of 1998?

  • How can we visualise the set of links to or from a particular site changing over time?

  • Treating the web as a corpora of text over time how can we track the emergence of words or concepts and their emergence from specialist vocabulary to general use?

  • As historians of technology, how we can use a web archive to track things like the emergence of PNG as an image format and the decline of XPM (the original icon format for graphical browsers such as Mosaic)?


I also wanted to show how open APIs or RESTful interfaces can allow others to develop innovative ways to view content. Since there weren't any web archives with such interfaces I fell back on demonstrating the point with Flickr, more particularly with simple visual beauty that is TagGalaxy. TagGalaxy shows how the ability to search and retrieve images and tags lets someone else build a completely different interface to the Flickr repository, one which minimises textual interaction and which encourages serendipitous discovery. It would have been wonderful to be able to do that with a web archive. Similarly, if Brian Kelly had been able to say to the Internet Archive 'give me all the versions of the home page of the University of Bath between these dates' in a single interaction, it would have been much easier for him to build the informative animation he used in his own presentations for JISC PoWR. I could go on, and at the time I did.

Much of what I hoped for then has happened. The architecture of Memento makes it straightforward to view collections of web archives as a single entity from some viewpoints. Projects funded by "Digging Into Data" have shown the power of large web collections in viewing the web as data at many levels. And although most (all?) web archives are not yet offering the APIs or interfaces that would permit us to do some of the things above, I think they at least accept that these are valid aspirations.

Moreover, web archiving has moved from being a specialist concern to something that appears in the letters pages of national newspapers. That, and the type of talks we're going to hear tomorrow, show how far we've moved in 2 1/2 years. I'm quietly confident that things are getting better.

2 comments:

  1. At the UK Web Archive, we're starting to explore some of the questions you raise. We've analysed the UK web domain history (1996-2010), and produced some datasets and visualisations based on it. There's an NGram search interface, which runs a bit slow, but allows you to track the use of terms and phrases over time (http://www.webarchive.org.uk/ukwa/ngramia/). There's an overall domain link analysis, showing how the way in which different subdomains are interlinked year by year (http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage).

    On the digital preservation side, there's also a year-by-year format profile, and some example results including showing how PNG and XBM usage has changed (http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/fmt).

    Finally, there's also a GeoIndex, which allows the web archive to be linked to locations via postcodes.

    Do let us know if you have any feedback, or anything else you think we should try to do with the archives.

    ReplyDelete
  2. Well, those links didn't really work. Here's a link to the top-level the UKWA visualisations page. All the pages I mentioned above are linked from that page.

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.