Tuesday, 6 October 2009

iPres 2009: Pennock on ArchivePress

Blogs are a new medium but an old genre, witness Samuel Pepys’ diaries for instance (now also a blog!). But since they are web based, aren’t they already archived through web archiving? However, simple web archiving treats blogs simply as web pages; pages that change but in a sense stay the same. Web archiving also can’t easily respond to triggers, like RSS feeds relating to new postings. Web archiving approaches are fine, but don’t treat the blogs as first class objects.

New possibilities can help build new corpora for aggregating blogs to create a preserved set for institutional records and other purposes. ArchivePress is a JISC Rapid Innovation (JISCRI) project, which once completed will be released as open source. The project started with a small 10-question survey, for which the key question was: which parts of blogs should archiving capture. In descending order the answers were posts, comments, tag & category names, embedded objects, and the blog name & URLs. These findings were broadly in agreement with an earlier survey 9see paper for reference).

Set out to find the significant properties of blogs. Significant properties, they see as in the eye of the stakeholder. First round this includes content (posts, comments, embedded objects), context (including authors & profiles), structure, rendering and behaviour.

To achieve this, they build on the Feed plugin for WordPress, which gathers the content as long as a RSS or Atom feed is available. WordPress is arguably the most widely used, it’s open source, it’s GPL and it has publicly available schemas.

Maureen showed the AP1 demonstrator based on the DCC blogs [disclosure: I’m from the DCC!], including blog posts written today that had already been archived. The AP2 demonstrator (the UKOLN collection) will harvest comments, and resolving some rendering and configuration issues from AP1; and will allow administrators to add new categories (tags?).

It seems to work; there turned out to be more variations in feed content than expected. Configuration is tricky, so must make it easier.


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.