Digital Curation Blog: August 2009

Tuesday, 18 August 2009

Off-site disaster recovery and security

I took an interesting help desk call today, from an IT person supporting a research centre, asking if the DCC offered off-site disaster recovery type services (perhaps this was really business continuity support services, but the answer's the same). The short answer is that we don't, but the longer answer was an interesting discussion about the centre's needs. The most interesting thing to me was the strong message that local University systems were not geared up to handle specific services of this type. I've commented before about the difficulties in getting reasonable backup systems in place (see My Backup Rant), and this is one step further. I think this person will be very capable of putting a good local system in place, but given the long-term value and sensitive nature of the data, needs a good quality service to provide that bit extra. It's looking as if he might have to go to the private sector to achieve this, although there may be services linked to HE, such as AIMES Grid Services (linked to Liverpool University, I believe). Perhaps the Atlas Petabyte Store at STFC is an in-sector service that might do the trick.

There was a question on whether (in the future) it might be sensible for such a project to attempt to get certified to ISO 27000. That's a big task; my guess is it might be too big a hurdle for a research centre to jump through. However, I believe that taking an approach linked to ISO 27000, without attempting to go as far as certification, can be extremely valuable. In particular, the ISO 27000 approach involves doing a security risk analysis, building what's referred to as an Information Security Management System (ISMS) to deal appropriately with the risks, and reviewing that ISMS in a continuos improvement cycle known as Plan-Do-Check-Act (PDCA). From my experience approaching this systematically can reveal that previously un-considered risks are more significant than obvious headline risks.

How important is "off-site"? I do think it is important that off-site means not in the same building. But it doesn't necessarily have to mean not in the same city. Here in Edinburgh, I would quite happily regard the main IT services at Kings Buildings, a couple of miles up the road, as off-site. In deciding, you do have to think about the threats; in central London, you might need to think about whether certain types of threats make larger areas on the city inaccessible, in a way that would probably be less likely in Edinburgh. But if you're locked out of your offices and your local IT services have been destroyed, just having an off-site backup may be comforting but is not going to get you up and running in the short term.

I thought there might be some current JISC information relevant to this, but on a quick scan I wasn't able to identify anything.

Note this is a much simpler service than data preservation or even data curation; much more of a standard commercial offering.

But it is good to see research centres taking these issues seriously.

Friday, 14 August 2009

DCC web site and Linked Data

We at the DCC are in the early stages of refreshing our web site (www.dcc.ac.uk). Nothing you can see yet, but we're talking to a few consultants about what and how we can do better. The ones we have spoken to so far seem pretty clued up on content management systems, and even on web 2.0 approaches. But questions about the role of the Semantic Web or Linked Data get blank looks.

Now our web site is not and will probably never be a major source of data as facts; rather it should contain resources: often documents, sometimes tools, sometimes sharing opportunities. There definitely are facts of various kinds there (which may not sufficiently explicit yet), such as staff contact details, document metadata, event locations and times, etc. But these are a comparatively small part of the content.

Does this (or anything else) justify investment in building a web site that is based on Linked Data/Semantic Web? What advantages could we get in doing so? What advantages could our users get if we did so?

I would really like to get some views on this!

Monday, 10 August 2009

Forgetting to remember

After a Sunday Times article prompted yesterday's piece of whimsy, a Tweet from my standard Twitter search ( (digital OR data) AND (preservation OR curation), since you ask) produced an interesting article by Chris O'Brien, a columnist for MercuryNews.com: "Time to clean up your digital closet". He goes quite nicely through the various ways in which our personal digital content is more at risk than we might think (media degradation, device and format obsolescence, and the sheer anonymity of large quantities of digital stuff). But he has a prescription for dealing with some of it, part of which I reproduce here (I hope fair use covers this, since you'll have to go to the original to read the rest!):

"However, all is not lost. There are some strategies for storing your digital archives. But you'll have to do a lot of work. You will need to start thinking like a librarian and become an active curator of your files. That means relentlessly organizing, labeling and tagging, backing up and deleting.

The first and most important thing to do is to begin deleting files. Whittle things down to the essentials. What do you really want to maintain and pass along? You must be ruthless and vigilant.

Next, develop a system for organizing files online and offline. If you're going to store stuff on removable media, like DVDs, place them in cases that have extensive labels, and index them. And don't store files like text documents or photos on propriety formats that are not widely adopted. Experts recommend photos in JPG forms and documents in PDF formats or basic text formats.

Label every file and tag them with as much information as you can. Being obsessive now will pay off in the long run. This is a lot of work, which is why you want to cull your archives as much as possible.

Once that's done, make multiple copies. You can also explore "cloud" backup services..."

Thinking like a librarian? Being an active curator of your files? Sounds like a good place to start. Interesting that he sees deleting as being an important part of remembering! We probably need better tools for the average person for a lot of this (eg tagging files in a filestore), but I suspect there's enough around for any reasonably competent researcher to use. However, laziness, forgetfulness and sheer pressure of work are our enemies here. Will we forget to do the things needed to remember?

Sunday, 9 August 2009

Remembering to forget

Are we getting this digital preservation thing all wrong? An article in today's Sunday Times quotes Viktor Mayer-Schonberger that we're creating a "digital memory that vastly exceeds the capacity of our collective human mind". James Harkin reports that Viktor wants us to forget more. Mind you, the main (and earliest) example given concerns an unfortunate photo from as recently as 2006. I'm sure the lady concerned would have remembered it only too well, whether or not the Internet example had come to light.

It's a daft story, but there is an interesting angle. Many preservation "systems" carry the risk that things will be preserved that some would prefer forgotten (eg the famous Bush speech). When the powerful want to change the record, the Web both facilitates and resists them. Web sites (and archives) are generally under some kind of centralised control, and subject to pressure, which they may or may not be able to resist. There are rumours of web-based reports being retrospectively "fixed". But once reports have got out into the wild, as it were, it is much harder to "fix" them, as the example above shows.

This doesn't mean that archives are a bad thing. They bring professional standards to the keeping of history. But perhaps it's a Good Thing that there's an uncontrollable un-system of citizens keeping (probably illegal) copies of some important and uncomfortable records. Even if it does mean that the lady's embarrassing photo stays around longer than she would like!

Thursday, 6 August 2009

JISC Data Management Infrastructure call

I gather from a tweet from Simon Hodson (Programme Manager at JISC for the Data Management Infrastructure call, 07/09), that 34 bids were received for this call. With maybe 8 able to be fiunded within the funding envelope, this looks like a very strong chance of getting an extremely interesting programme. I guess quite a few of us will be getting a batch of proposals to review in the next few days; despite a small feeling of dread at the workload (mostly evenings), I usually find I quite enjoy this process! There's so much to learn, even if (theoretically) we are supposed to forget it immediately!

Wednesday, 5 August 2009

Curated databases and data curation

I've been working on an article with a colleague, and came across something that's interesting, and that you might be able to help us with. There does appear to be a distinction between the way curation is used in the bio-sciences, and elsewhere. In particular, the term "curated database" tends to mean a manually constructed database that links literature to data, curated by experts who provide authority (eg see the Wikipedia definition of Biocurator). The earliest mention of the term "curated database" I can find is in the abstract (and only in the abstract) of Larsen et al (1993).

However, the terms "digital curation" and "data curation" tend to mean something different. We in the DCC say "Digital curation is maintaining and adding value to a trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials". This has a lot more elements of good management about it, and less of the idea of specific curators making judgements based on the literature. It has also been rather conflated with digital preservation.

The earliest reference to digital curation I can find is a report of an invitational meeting held in October 2001, oddly titled "The Digital Curation: digital archives, libraries and e-science seminar" (Beagrie and Pothen, 2001). In the meeting there was some discussion about data curation. The earliest more formal reference to data curation I can find is a technical report from Gray, Szalay et al (2002).

So my challenge is this: are there earlier references to digital (or data) curation, of the second kind?

Beagrie, N., & Pothen, P. (2001). The Digital Curation: digital archives, libraries and e-science seminar. Ariadne. http://www.ariadne.ac.uk/issue30/digital-curation/.

Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., & vandenBerg, J. (2002). Online Scientific Data Curation, Publication, and Archiving. Redmond: Microsoft Research. http://arxiv.org/abs/cs.DL/0208012.

Larsen, N., Olsen, G. J., Maidak, B. L., McCaughey, M. J., Overbeek, R., Macke, T. J., et al. (1993). The ribosomal database project. Nucl. Acids Res., 21(13), 3021-3023. http://nar.oxfordjournals.org/cgi/content/abstract/21/13/3021

Digital Curation Blog