Tuesday 28 July 2009

My backup rant

I did a presentation on Trust and Digital Archives at the PASIG Malta meeting; not a very good presentation, I felt. But somewhat at the last minute I added a scarcely relevant extra slide on my favourite bete noir: Preservation’s dirty little secret, namely backup, or rather the lack of it. Curation and preservation are about doing better research, and reducing risk to the research enterprise. But all is for nothing if those elementary prior precautions of making proper backups are not observed. You can’t curate what isn’t there! Anyway, that part went down very well.

Of course the obvious reaction at this point might be to say, tsk tsk, proper backups, of course, everyone should do that. But I’m willing to bet that the majority of people have quite inadequate backup for the home computer systems; the systems on which, these days, they keep their photos, their emails, their important documents. Or worse, they think that having uploaded their photos to Flickr means they are backed up and even preserved.

There’s a subtext here. Researchers are bright, they question authority, they are even idiosyncratic. They often travel a lot, away from the office for long periods. They have their own ways of doing things. Yes, some labs will be well organised, with good systems in place. But others will leave many things to the “good sense” of the researchers. In my experience, this means a wide variety of equipment, both desktop and laptop with several flavours of operating system in the one research group. One I saw recently had pretty much the entire gamut: more than one version of MS Windows, more than one version of MacOS/X, and several versions of Linux, all these on laptops. Desktop machines in that group tended to be better protected, with a corporate Desktop, networked drives and organised back systems. But the laptops, often the researchers’ primary machines, were very exposed. My own project has Windows and Mac systems (at least), and is complicated by being spread across several institutions.

The "good sense" of researchers apparently leaves a lot to be desired, according to a few surveys we've seen over the past couple of years, including the StORe project (mentioned in an earlier blog post). At a recent meeting, we even heard of examples of IT departments discouraging researchers from keeping their data on the backed-up part of their corporate systems, presumably for reasons of volume, expense, etc.

To take my own example, I have a self-managed Mac laptop with 110 GB of disk or so. My corporate disk quota has been pushed up to a quite generous 10 GB. The simplest backup strategy for me is to rsync key directories onto a corporate disk drive, and let the corporate systems take the backup from there. Someone wrote a tiny script for me that I run in the underlying UNIX system; typically it takes scarcely a couple of minutes to rsync the Desktop, my Documents and Library folders (including email, about 9 GB in all), when in the office. But I keep downloading reports and other documents, and soon I’ll be bumping up against that quota limit again.

For a while I supplemented this with backup DVDs, and there’s quite a pile of them on my desk. But that already doesn’t work as my needs increase.

Remember, disk is cheap! No-one buys a computer with less than 100 GB these days. My current personal backup solution was to supplement the partial rsync backup with a separate backup (using the excellent Carbon Copy Cloner) to a 500 GB disk kept at home, on a USB 2 port. I backup a bit more (Pictures folders etc), but it’s MUCH slower, taking maybe 12 minutes, most of which seems to be a very laborious trek through the filesystem (rsync clearly does the same task much faster). By the way, that simple, self-powered disk cost less than £100, and a colleague says I should have paid less. I know this doesn't translate easily into budget for corporate systems, but it certainly should.

But this one-off solution still leaves me unable to answer a simple question: are my project’s data adequately backed up? My solution works for someone with a Mac; the software doesn’t work on Windows. It seems to be everyone for themselves. As far as I can see, there is no good, simple, low cost, standardised way to organise backup!

I looked into Cloud solutions briefly, but was rather put off by the clauses in Amazon’s agreements, such as 7.2 ("you acknowledge that you bear sole responsibility for adequate security, protection and backup of Your Content"), or 11.5, disclaiming any warranty "THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED" (more on this another time). That certainly makes the appealing idea of Cloud-based backups rather less attractive (although you could perhaps negotiate or design around it).

I think what I need is a service I can subscribe to on behalf of everyone in my project. I want agreed backup policies that allow for the partly disconnected existence, experienced by so many these days. I want the server side to be negotiable separately from the client side, and I want clients for all the major OS variants (for us, that’s Windows, Mac OS/X and various flavours of Linux). I think this means an API interface, leaving room for individuals to specify whether it should be a bootable or partial backup, which parts of the system are included or excluded, and for management to specify the overall backup regime (full and incremental backups, cycles, and any “archiving” of deleted files, etc). I want the system to take account of intermittent and variable quality connectivity; I don’t want a full backup to start when I make a connection over a dodgy external wifi network. I don’t want it to work only in the office. I don’t want it to require a server on-site. On the one hand this sounds a lot; on the other, none of this is really difficult (I believe).

Does such a system exist? If not, is defining a system like this a job for SNIA? JISC? Anyone? Is there a demand from anyone other than me?

Turmoil in discourse a long term threat?

Lorcan Dempsey mentioned a meeting with Walt Crawford, whom I don't know, in the light of his feeling that "some of the heat had gone out of the blogosphere in general", and reported:
'Walt, whom I was pleased to bump into [...], is probably right to suggest in the comments that some energy around notifications etc has moved to Twitter: "Twitter et al ... have, in a way, strengthened essay-length blogging while weakening short-form blogging (maybe)-and essays have always been harder to do than quick notes"'
That ties in to my experience to some extent. I've just published a blog post from Sun-PASIG in Malta, which ended a month ago (not really an essay, but something where it was hard to get the tone just right), and I have a bunch of other posts in the "part-written" pipeline. Tweets are a lot easier.

But that isn't quite my point here. I'm a little concerned that the new "longevity" threat may not be the encoding of our discourse in obsolete formats, and not even our entrusting it to private providers such as the blog systems (as long as it IS open access, and preferably Creative Commons). The threat may be the way new venues for discourse wax and wane with great rapidity. We can learn to deal with blogs, we can even have a debate on whether the twitterverse is worth saving (or how much of it might be). Do we need to worry about other more social media (MySpace, Facebook, Flickr and so many lesser pals; so heavily fractured)? They're not speech, they're not scholarly works, but they have some significance (particularly in documenting significant events) somewhere in between. We could learn to deal with any small set of them, but by the time we work out how they could be preserved, and how parts might be selected, that set would (as is suggested above for blogs) already be "so last year".

BTW, part of this space is being addressed by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access. I'm attending one of their meetings over the next two days, on my first visit to Ann Arbor, Michigan. Among the things we're looking at are scenarios that currently include social media. I'll try and write a bit more about it, but it's not really the sort of meeting you can blog about freely...

Rosenthal at Sun-PASIG in Malta

I was very pleased to hear David Rosenthal reprise his CNI keynote on digital preservation for the Sun-PASIG meeting in Malta, a few weeks ago now. David is a very original thinker and careful speaker. I’ve fallen into the trap before of mis-remembering him, and then arguing from my faulty version. I even noted two tweets made contemporaneously with his talk, that misquoted him and changed the meaning subtly (see below). Luckily, David has made his CNI presentation available in an annotated version on his blog, so I hope I don’t make the same mistake again.

If you were not able to hear this talk, please go read that blog post. David has some important things to say, pretty much all of which I agree strongly with. No real surprise there, as part of the talk at least echoes concerns I expressed in the “Excuse Me…” Ariadne article (Rusbridge, 2006), which on reflection was probably influenced by earlier meetings with David among others.

So here’s the highly condensed version: Jeff Rothenberg was wrong in his famous 1995 Scientific American article (Rothenberg, 1995). The important digital preservation problems for society are not media degradation or media obsolescence or format obsolescence, because important stuff is online (and more or less independent of media), and widely used formats no longer go obsolescent the way they used to when Jeff wrote the article. The important issue is money, as collecting all we need will be ruinously expensive. Every dollar we spend on non-problems (like protecting against format obsolescence) doesn’t go towards real problems.

And if you are so imbued with conventional preservation wisdom as to think that summary is nonsense, but you haven’t read the blog post, go read it before making up your mind!

David concludes:

"Practical Next Steps

Everyone - just go collect the bits: Not hard or costly to do a good enough job, Please use Creative Commons licenses

Preserve Open Source repositories: Easy & vital: no legal, technical or scale barriers

Support Open Source renderers & emulators
Support research into preservation tech: How to preserve bits adequately & affordably? How to preserve this decade's dynamic web of services? Not just last decade's static web of pages"
So what are the limitations of this analysis? My quick summary from a research data viewpoint:

Lots of important/valuable stuff is not online

Quite a lot of this stuff is not readable with common, open-source-compatible software packages

We need to keep contextual metadata as well as the bits for a lot of this stuff… and yes, we do need to learn how to do this in a scalable way.

David clearly concentrates on the online world:

Now, if it is worth keeping, it is on-line

Off-line backups are temporary”

However, it’s worth remembering Raymond Clarke’s point in my earlier post from PASIG Malta about the cost advantages of offline. Particularly in the research data world, there is a substantial set of content that exists off-line, or perhaps near-line. Some of the Rothenberg risks still apply to such content. Let’s leave aside for the moment that parallels to the scenario that Rothenberg envisages continue to exist: scholars’ works encoded in obsolete digital media are starting to be ingested in archives. But more pressingly, some research projects report that their university IT departments discourage them from using enterprise backup systems for research data, for reasons of capacity limitations. So these data often exist in a ragbag collection of scarcely documented offline media (or may even be not backed up at all). In Big Science, data may be better protected, being sometimes held in large hierarchical storage management systems. A concern I have heard from the managers of such large systems is that the time needed to migrate their substantial data holdings from one generation of storage to the next can approximate the life of the system, ie several years. And clearly such systems are more exposed to risk.

Secondly, David’s comments about format obsolescence apply specifically to common formats. He says “Gratuitous incompatibility is now self-defeating”, and “Open Source renderers [exist] for all major formats” with “Open Source isn't backwards incompatible”. But unfortunately there are examples where there are valuable resources that remain at risk. There are areas with valuable content not accessible with Open Source renderers (eg engineering and architectural design). There are many cases in research where critical analysis codes are written by non-experts, with poor version control, poorly documented. And even in the mainstream world, format obsolescence can still occur in minority formats, for all sorts of reasons, including bankruptcy, but also including sheer bad design of early versions.

Finally, I’m sure David didn’t really mean “just keep the bits”. Particularly in research, but in many other areas as well, important contextual data and metadata are needed to understand the preserved data, and to demonstrate its authenticity. The task of capturing and preserving these can be the hardest part of curating and preserving the data, precisely because those directly involved need less of the context.

Oh, that double mis-quote? Talking of the difficulty of engaging with costly lawyers, David said “1 hour of 1 lawyer ~ 5TB of disk [-] 10 hours of 1 lawyer could store the academic literature”. One tweet reported this as “Lawyer effects; cost of 10 lawyer hours could save entire academic literature!” and the other as “10 hours of a lawyer's time could preserve the entire academic literature”. See what I mean? Neither save nor preserve mean the same as store!

Overall, David does a great job, in his presentation, blog post and other writings, in reminding us not to blindly accept but to challenge preservation orthodoxy. Put simply, we have to think for ourselves.

Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=9501173513&site=ehost-live

(yes, that URL IS the "permanent URL according to Ebsco!)

Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.

Friday 24 July 2009

Semantic Web of Linked Data for Research?

In the beginning was the World Wide Web. Then we were going to have the Semantic Web. (Then we had Web 2.0, but that’s another story.) But maybe the Semantic Web wasn’t semantic enough for some, so they changed the name to Linked Data, and it began to take off a little more. Now there’s an argument on whether all linked data are Linked Data!

The debate started with Andy Powell asking on Twitter what name we should use when all the conditions for Linked Data are met except for one, which was the requirement that data be expressed in standards, specifically RDF (see Andy's summary). Tim Berners Lee had suggested there were 4 principles for Linked Data:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

There were quite strong divisions; one group says roughly: “Linked Data is a brand and a definition; live with it”, while the other group says something like “Linked Data can afford to be inclusive, and will benefit from that” (both of these are extreme simplifications). I’ve read all the remarks and they’re pretty convincing; I mostly agree with them (not much help to you, gentle reader!). Paul Walk's summary is quite balanced. However, I particularly liked a comment made on someone else’s blog post by Dan Brickley, who should know about RDF (quoted by Andy in the post mentioned above):

“I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Statistical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analogous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)”

I think this makes lots of sense for research data. I’ve been wondering for some time how RDF fits into the world of research data. I asked the NERC Data Managers at their meeting earlier this year, and the general consensus appeared to be that RDF was good for the metadata, but not the actual research data. This seems reasonable and is consistent with Dan’s view above.

But it does rather raise the question about exactly what kinds of data RDF IS suitable for. It begins to look as if it is good for isolated facts, simple relationships and descriptive data. While RDF probably can encode most things you would put in databases or scientific datasets, generally it would be very difficult to express what those databases and datasets can express, and there would be a massive explosion of triples if one tried.

To answer Andy’s original question (what name…), although I was taken with the idea of linked data, it’s clearly too easy to confuse with Linked Data. So I think I’d go with Paul Walk’s suggestion of Web of Data, or interchangeably Dan Brickley's data Web. If we can weave research data into a Web of Data, we’ll be doing well!

Thursday 23 July 2009

IJDC Volume 4(1) was published

That's volume 4, issue 1 of the International Journal of Digital Curation... and I didn't report it here. My apologies for that. It's our biggest issue yet, with 10 peer-reviewed papers and 4 general articles, plus 2 editorials (a guest editorial from Malcolm Atkinson, and a normal one from me). There's some really interesting stuff, mostly from the Digital Curation Conference in Edinburgh last year.

There are still a few papers from last year's conference to come, plus a selection from iPres 2008 at the BL in London as well. We are also hoping that some papers will emerge from iPres 09, which has just opened registration, and will shortly be feeding back the results of their selection process to authors. Still time to submit to this year's Digital Curation Conference, guys (submissions close August 7, 2009).

We have done a couple of interesting analyses on the IJDC. One was a "readership analysis" based on web stats, for the period January-June 2009. Eight out of the ten most down-loaded papers in that time were from IJDC 3(2) (the ninth was from 3(1), and the tenth was from IJDC 1). These 10 papers were down-loaded just under 440 times each during that period (395 to 485 times).

The second was to use Google Scholar to assess citations for the issues up to and including 3(2). Issue 4(1) is too recent. I checked the peer-reviewed papers, which GS suggested had been cited 92 times (maximum 11 times for that most-down-loaded Beagrie article from Issue 1), for an average of 3.3 citations per paper. I also checked the articles, although I ignored simple reports, editorials and reviews. Counting peer-reviewed papers and checked general articles, there were 142 citations, for an average of 2.7 citations per item.

Only one out of those eight most down-loaded papers in issue 3(2) had translated those downloads into significant citations, the Cheung paper has 6. But we should give them time, I think; citations per checked item per issue are noticeably lower for more recent items, as you might expect.
  • 4.2 in IJDC 1
  • 3.3 in IJDC 2(1)
  • 4.2 in IJDC 2(2)
  • 2.1 in IJDC 3(1)
  • 1.4 in IJDC 3(2)
By the way, we are particularly proud of one citation of an IJDC paper from a paper in Nature's Big Data Issue (Howe et al, The future of biocuration). The citation was of the Palmer et al paper in IJDC 2(2)... but Google Scholar failed to notice it. So these figures come with a few caveats!

Managing and sharing data

I really like the UK Data Archive publication "Managing and Sharing Data: a best practice guide for researchers". It's been available in print form for a while, although I did hear they had run out and were revising it. Meanwhile, you can get the PDF version from their web site. It is complemented by sections of their web site which reflect, and sometimes expand on, the sections in the Guide. The sections are:
Sharing data - why and how?
Consent, confidentiality and ethics
Copyright
Data documentation and metadata
Data formats and software
Data storage, back-up, and security
This is an excellent resource!

Wednesday 15 July 2009

Digital Curation Conference deadline extended

5th International Digital Curation Conference (IDCC09)
Moving to Multi-Scale Science: Managing Complexity and Diversity.
2 – 4 December 2009, Millennium Gloucester Hotel, London, UK.
**************************************************************************
We are pleased to announce that the Paper Submission date for IDCC09 has been extended by 2 weeks to Friday 7 August 2009: http://www.dcc.ac.uk/events/dcc-2009/call-for-papers/
Remember that submissions should be in the form of a full or short paper, or a one page abstract for a poster, workshop or demonstration.
Presenting at the conference offers you the chance to:-

  • Share good practice, skills and knowledge transfer
  • Influence and inform future digital curation policy & practice
  • Test out curation resources and toolkits
  • Explore collaborative possibilities and partnerships
  • Engage educators and trainers with regard to developing digital curation skills for the future

Speakers at the conference will include:-

  • Timo Hannay – Publishing Director, Nature.com
  • Professor Douglas Kell – Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC)
  • Dr. Ed Seidel, Director of the National Science Foundation’s Office of Cyberinfrastructure

All papers accepted for the conference will be published in the International Journal of Digital Curation

Sent on behalf of the Programme Committee –

co-chaired by Chris Rusbridge, Director of the Digital Curation Centre, Liz Lyon, Director of UKOLN and Clifford Lynch, Executive Director of the Coalition for Networked Information.

Online and offline storage: cost and greenness at Sun PASIG

I was at the Sun Preservation and Archiving SIG meeting in Malta a couple of weeks ago, a very interesting meeting indeed. The agenda and presentations are being mounted here. I’ll try to pick out some points that are worth briefly blogging about, if I can.

Raymond Clarke of Sun and SNIA spoke quite early on about storage (presentation not up yet). You may know that Sun now holds the Internet Archive, apparently in a mobile data centre (basically a standard container in a car park!). The point I was interested in was Raymond’s comments about tape, which he said was the fastest growing market segment. He said the cost ratio of disk:tape storage was 23:1. But, noting that storage represents around 40% of the power consumption of data centres, and given our current environmental concerns, it’s notable that the energy ratio for disk:tape is 200:1!

On risk, a separate presentation by Moreira (pdf) also spoke about the green and cost advantages of tape (although he only identified a 3:1 advantage), but I note two tweets from the time:

dkeats: Tapes degrade, data loss happens, there is risk, need to measure quality of tapes in real time, and manage risk. #pasig

cardcc: #pasig Moreira shows green & cost advantages of tape, but concerns on fragility & lifetime 5-10-30 years, but maybe only days if mistreated

Now there was an element of standard vendor pitch towards hierarchical storage management systems. But if you have very large volumes, sufficient value but relatively low re-use rate, then despite some of the significant disadvantages of tape, those numbers have to drive you towards a tape solution for preservation and archiving! It certainly will significantly increase the up-front investment cost, the data centre management requirements, and possibly some risk factors, but given sufficient volumes you should make savings overall. And this could apply to quite a bit of research data...

Tuesday 14 July 2009

To render or to compute?

From time to time I hear fruitless discussions about rendering documents versus computing data (maybe this post is another one). Person X says something like “in order to be able to preserve these documents, we need to preserve the means to render them.” But person Y says “ah, but for data rendering is not enough. I don’t want to see the data values, I need to be able to feed them into complex computations. That’s quite different.”

I want to yell and shout (I don’t, but I guess my blood pressure goes up). I guess there may be some kind of essential difference here that these discussions are failing to capture. But to me it’s angels dancing on a pin. Rendering a modern web page involves computations unimaginable barely 20 years ago (other than on Evans and Sutherland high end graphics systems). You take a substantial amount of data from perhaps dozens of different sources, in different kinds of loosely standardised formats, and you project it onto a user-defined space, dependent on a wide variety of different settings; this is tough! Rendering a web page can also change the state of complex systems (hey, it can move hundreds of pounds from my bank account into an airline’s bank). Rendering a web page is serious computation.

A significant part of rendering (one might think, THE significant feature of rendering, with the exception of the state-changing role above) is to present information for the human consumer. That’s also a key intent of computation on data. I’ll assert that with few exceptions, ALL computation is ultimately intended to render results for human consumption. You may spend a gazillion petaflops sifting through data and silently chucking out the uninteresting stuff. But sooner or later your computation finds some prima facie evidence of a possible Higgs Boson, and sure as eggs is eggs you want it to tell you. I reckon you’ve just rendered the result of part of your LHC experiment!

So perhaps rendering involves (potentially) multiple inputs with some state change and significant human-readable output. And computation involves (potentially) multiple inputs and outputs and state changes, with eventually some human-readable output. No, I give up. Anyone got a better idea?

Grump.

PS, apologies for the break in postings. I've been rather busy with a re-funding bid. I hope I can do better now! I'm way behind my targets for the period (:-(.