Tuesday, 28 July 2009

My backup rant

I did a presentation on Trust and Digital Archives at the PASIG Malta meeting; not a very good presentation, I felt. But somewhat at the last minute I added a scarcely relevant extra slide on my favourite bete noir: Preservation’s dirty little secret, namely backup, or rather the lack of it. Curation and preservation are about doing better research, and reducing risk to the research enterprise. But all is for nothing if those elementary prior precautions of making proper backups are not observed. You can’t curate what isn’t there! Anyway, that part went down very well.

Of course the obvious reaction at this point might be to say, tsk tsk, proper backups, of course, everyone should do that. But I’m willing to bet that the majority of people have quite inadequate backup for the home computer systems; the systems on which, these days, they keep their photos, their emails, their important documents. Or worse, they think that having uploaded their photos to Flickr means they are backed up and even preserved.

There’s a subtext here. Researchers are bright, they question authority, they are even idiosyncratic. They often travel a lot, away from the office for long periods. They have their own ways of doing things. Yes, some labs will be well organised, with good systems in place. But others will leave many things to the “good sense” of the researchers. In my experience, this means a wide variety of equipment, both desktop and laptop with several flavours of operating system in the one research group. One I saw recently had pretty much the entire gamut: more than one version of MS Windows, more than one version of MacOS/X, and several versions of Linux, all these on laptops. Desktop machines in that group tended to be better protected, with a corporate Desktop, networked drives and organised back systems. But the laptops, often the researchers’ primary machines, were very exposed. My own project has Windows and Mac systems (at least), and is complicated by being spread across several institutions.

The "good sense" of researchers apparently leaves a lot to be desired, according to a few surveys we've seen over the past couple of years, including the StORe project (mentioned in an earlier blog post). At a recent meeting, we even heard of examples of IT departments discouraging researchers from keeping their data on the backed-up part of their corporate systems, presumably for reasons of volume, expense, etc.

To take my own example, I have a self-managed Mac laptop with 110 GB of disk or so. My corporate disk quota has been pushed up to a quite generous 10 GB. The simplest backup strategy for me is to rsync key directories onto a corporate disk drive, and let the corporate systems take the backup from there. Someone wrote a tiny script for me that I run in the underlying UNIX system; typically it takes scarcely a couple of minutes to rsync the Desktop, my Documents and Library folders (including email, about 9 GB in all), when in the office. But I keep downloading reports and other documents, and soon I’ll be bumping up against that quota limit again.

For a while I supplemented this with backup DVDs, and there’s quite a pile of them on my desk. But that already doesn’t work as my needs increase.

Remember, disk is cheap! No-one buys a computer with less than 100 GB these days. My current personal backup solution was to supplement the partial rsync backup with a separate backup (using the excellent Carbon Copy Cloner) to a 500 GB disk kept at home, on a USB 2 port. I backup a bit more (Pictures folders etc), but it’s MUCH slower, taking maybe 12 minutes, most of which seems to be a very laborious trek through the filesystem (rsync clearly does the same task much faster). By the way, that simple, self-powered disk cost less than £100, and a colleague says I should have paid less. I know this doesn't translate easily into budget for corporate systems, but it certainly should.

But this one-off solution still leaves me unable to answer a simple question: are my project’s data adequately backed up? My solution works for someone with a Mac; the software doesn’t work on Windows. It seems to be everyone for themselves. As far as I can see, there is no good, simple, low cost, standardised way to organise backup!

I looked into Cloud solutions briefly, but was rather put off by the clauses in Amazon’s agreements, such as 7.2 ("you acknowledge that you bear sole responsibility for adequate security, protection and backup of Your Content"), or 11.5, disclaiming any warranty "THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED" (more on this another time). That certainly makes the appealing idea of Cloud-based backups rather less attractive (although you could perhaps negotiate or design around it).

I think what I need is a service I can subscribe to on behalf of everyone in my project. I want agreed backup policies that allow for the partly disconnected existence, experienced by so many these days. I want the server side to be negotiable separately from the client side, and I want clients for all the major OS variants (for us, that’s Windows, Mac OS/X and various flavours of Linux). I think this means an API interface, leaving room for individuals to specify whether it should be a bootable or partial backup, which parts of the system are included or excluded, and for management to specify the overall backup regime (full and incremental backups, cycles, and any “archiving” of deleted files, etc). I want the system to take account of intermittent and variable quality connectivity; I don’t want a full backup to start when I make a connection over a dodgy external wifi network. I don’t want it to work only in the office. I don’t want it to require a server on-site. On the one hand this sounds a lot; on the other, none of this is really difficult (I believe).

Does such a system exist? If not, is defining a system like this a job for SNIA? JISC? Anyone? Is there a demand from anyone other than me?


  1. One quick point, maybe slightly tangential - having multiple versions/types of operating system in use is not a bad thing for preservation, when considering them as storage.

    In fact, it is something that we are doing by design when considering the storage systems - they are not all on the same version, or even same operating systems at the same time.

    Systemic failures can be a killer, for example, ZFS had a bug added a year or so ago that caused wide-scale irreversible file corruption - it was quickly undone, but if all the systems were running with the same version of ZFS on the same operating system, *all* the storage would have been affected.

    So by choice, version updates will be rolled out gradually, with a compromise between the number of different setups and their maintenance guiding the variety of storage in use.

    And as for backup - I totally agree, the software in use is woefully inadequate and institutions tend to have a large gap where this type of service should be supported.

  2. Ben, you're right, diversity of OS is a good thing for preservation (although tougher for the backup case). I do remember though that diversity of OS was sometimes a necessary (and expensive) evil in some MIS circumstances, where certain applications could not be migrated forward to more modern OS versions, for reasons like supplier bankruptcy, etc. Although never expressed in that way to me at the time, these were preservation problems!

    But here my focus is on backup, and I don't want to lose that goal...


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.