Tuesday, 30 September 2008

iPres 2008 Web archiving

Thorstheinn Hailgrimsson from Iceland, talking about the IIPC. Established to foster development and use of common tools, techniques and standards, and to encourage national libraries to collect web materials. Now have 38 members, heavily concentrated in Europe and US. Currently in Phase 2 and planning Phase 3; Phase 2 has 4 working groups, on access, harvesting, preservation and standards. Achievements include enhancements to Heretrix crawler, tools like Web Curator from BL/NZNL, and Netarchive Curator Tool Suite from Denmark, plus access tools including NutchWAX for indexing, and an open source version of the Wayback machine. Also a WARC standard near the end of the ISO process. Three main approaches: bulk, selective based on criteria, eg publicly funded stuff, and event-based such as around an election, disaster, etc. Iceland with 300,000 population already has 400 million documents, 8 TB data, so larger domains soon get into billions of documents, and many, many terabytes!

Helen Hockx Yu from the BL talking about web archiving in the UK. Started with UKWAC in 2004, which does permission-based selective archiving. Covers 3,700 unique web sites and >11,000 instances with about 2 TB data. Building an Ongoing Web Archiving Programme for the UK domain. Biggest problem really is that legal deposit legislation not yet fully implemented, and not expected until 2010! Failing he legislative mandate, permission-based archiving is slow and typical success rate is 25%. So suspect valuable web sites are already being lost.

Birgit Henriksen from Denmark: they have been archiving some parts of the web since 1998, and following new legislation, have been doing full Danish domains since 2005. In 2008 2.2 billion objects, 71 TB data, 5 FTE staff. Access is only for research and statistical purposes, which is hard to implement. They do bulk harvests quarterly, selective sites more frequently (sometimes daily), and event-based. Events are things that create debate or are expected to be of importance to Danish history. Preservation involved from the start: redundancy in geography, hardware architecture and vendor, storage media and software, and do active bit preservation based on checksum comparisons.

Gildas Illian from BNF in France; they have a legal deposit law since 2006. Now have 12 Billion URIs, 120TB, 7 FTE, 100 curators and partners involved. Partnered with Internet Archive. Provided in-house access since April 2008. Challenge of change: digital curators not a new job, need to change librarianship. Legal Deposit team as “honest broker” between collections/curators and the IT development and operations.

Colin Webb from NLA in Australia: been doing selective web archiving since 1 April, 1996, with negotiated permissions, quality control and access (however, get about 98% permission success). Also full domain harvests each year since 2005 (1 billion files this year). No legal deposit law (yet, despite lobbying), so have not yet provided access to the full domain crawls. Challenges are interconnected: what we want to collect, what we’re allowed to collect, what we’re able to collect (?) and what we can afford to collect.

Colin then talking about the preservation working group of IIPC, set up in 2007 after the initial concerns to get capture organised had abated. Brief: to identify preservation standards and practices applicable to web archives. Raises questions about whether web preservation differs from “ordinary” digital preservation. What about the scale effects? Work plan includes annual survey of web technical environment (browsers needed, etc). Comment suggested how important this was, hoped they would publish results and continue for many years to do so.

Question: can people whose web site is at risk get it preserved? This has started to happen. Also starting to see requests from people who have had disasters, wanting (and getting) help in recovering their web sites! Nice French story about political parties who have re-done their web sites, and now don’t remember what they promised a few years ago!


