Looking back at my concerns in mid-2009 I'm greatly reassured. There were a number of worrying trends apparent in web archives at that time and an apparent lack of bold vision in how we might use web archives in the future - or even in the present. My fear was that the collecting policies, preservation policies and interfaces offered were all taking a very human and document-centric view of what a web archive should do. In OAIS terms, the Designated Community was people who wanted to view individual old web pages having done a search for a particular site, or possibly for a keyword of some sort. The National Archives had taken one incremental but powerful step beyond that, automatically linking archived web pages to 404 pages on government web sites via simple plugins for Apache & IIS, but in the end this still involved serving individual pages for people to read.
That's a valid use case, but by no means the only ones. I set out a few other things we might want to be able to do but could not with the interfaces that web archives gave us.
- What search results would we have got on the web of 1998 using the search engines of 1998?
- What results would we have got using current search engines on the web of 1998?
- How can we visualise the set of links to or from a particular site changing over time?
- Treating the web as a corpora of text over time how can we track the emergence of words or concepts and their emergence from specialist vocabulary to general use?
- As historians of technology, how we can use a web archive to track things like the emergence of PNG as an image format and the decline of XPM (the original icon format for graphical browsers such as Mosaic)?
I also wanted to show how open APIs or RESTful interfaces can allow others to develop innovative ways to view content. Since there weren't any web archives with such interfaces I fell back on demonstrating the point with Flickr, more particularly with simple visual beauty that is TagGalaxy. TagGalaxy shows how the ability to search and retrieve images and tags lets someone else build a completely different interface to the Flickr repository, one which minimises textual interaction and which encourages serendipitous discovery. It would have been wonderful to be able to do that with a web archive. Similarly, if Brian Kelly had been able to say to the Internet Archive 'give me all the versions of the home page of the University of Bath between these dates' in a single interaction, it would have been much easier for him to build the informative animation he used in his own presentations for JISC PoWR. I could go on, and at the time I did.
Much of what I hoped for then has happened. The architecture of Memento makes it straightforward to view collections of web archives as a single entity from some viewpoints. Projects funded by "Digging Into Data" have shown the power of large web collections in viewing the web as data at many levels. And although most (all?) web archives are not yet offering the APIs or interfaces that would permit us to do some of the things above, I think they at least accept that these are valid aspirations.
Moreover, web archiving has moved from being a specialist concern to something that appears in the letters pages of national newspapers. That, and the type of talks we're going to hear tomorrow, show how far we've moved in 2 1/2 years. I'm quietly confident that things are getting better.