Friday 31 August 2007

Archiving Service

The other day I had an interesting discussion with a group of IT facilities management guys here at Edinburgh, who have been asked to write the requirements for a new version of their Archive Service, capable of handling modern requirements for huge amounts of data. It was an interesting discussion, and I hope to remain involved in what they are planning (which is a long way from getting funded, I guess). They had some good ideas, thinking perhaps derived in part from software management systems, of checking data in and out, managing versions, etc.

We spoke a bit about OAIS, and about the issues of making data understandable for the long term. They were somewhat reluctant to go there, not surprisingly; general IT attitude to "archiving" in the past has tended to be something like: "You give me your blob of data, and I'll give you back an identical blob of data at some future time". IT people are pretty comfortable about that approach (although I also mentioned some emerging issues such as those mentioned by David Rosenthal in his blog post about keeping very large data for very long times).

We also discussed the need to make the data accessible from outside the University, to allow it to be linked from publications. This too was a bit outside their previous remit.

It struck me afterwards how much the conversation was affected by the IT services point of view (I'm not trying to denigrate these individuals at all; as far as I can see they took it all on board). I have talked with a lot of people about "digital preservation" or "repositories"; in most of those conversations, the issues about managing the bits themselves are assumed to be pretty much a solved problem, and the conversation takes different forms, about metadata, about formats, about representation information, about emulation versus migration, and so on. The participants tend to be linked to the library, or to some subject area, but rarely from IT.

I wondered if it was possible to imagine the IT guys providing a secure substrate on which a repository service sits, with independence between the two; that would allow everyone to stay in their comfort zone. I don't think OAIS would quite work in those terms, and I think there would be dependencies both ways, although I'll have to think more about this. The only counter example I have heard of is a rumour of Fedora providing high level services based on SRB as a secure storage substrate, although I don't have a reference.

What was also interesting is how few examples we could think of in Universities, of large scale, long term archiving systems for records and data, leaving aside the publication-oriented repository movement.

Can any readers give us pointers?

Wednesday 29 August 2007

IJDC again

At the end of July I reported on the second issue of the International Journal of Digital Curation (IJDC), and asked some questions:
"We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

"For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet."

I didn't get any responses, but over on the eFoundations blog, Andy Powell was taking us to task for only offering PDF:
"Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy."
His blog is more widely read than this one, and he attracted 11 comments! The gist of them was that PDF plus HTML (or preferably XML) was the minimum that we should be offering. For example, Chris Leonard [update, not Tom Wilson! See end comments] wrote:
"People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions availble for precisely this purpose."
Cornelius Puschmann [update, not Peter Sefton] wrote:
"Yeah, but if you really want semantic markup why not do it right and use XML? The problematic thing with OJS (at least to some extent) is/was that XML article versions are not the basis for the "derived" PDF and HTML, which deal almost purely with visuals. XML is true semantic markup and therefore the best way to store articles in the long term (who knows what formats we'll have 20 years from now?). HTML can clearly never fill that role - it's not its job either. From what I've heard OJS will implement XML (and through it neat things such as OpenOffice editing of articles while they're in the workflow) via Lemon8 in the future."
Bruce D'Arcus [update, not Jeff] says:
"As an academic, I prefer the XHTML + PDF option myself. There are times I just want to quickly view an article in a browser without the hassle of PDF. There are other times I want to print it and read it "on the train."

"With new developments like microformats and RDFa, I'd really like to see a time soon where I can even copy-and-paste content from HTML articles into my manuscripts and have the citation metadata travel with it."
Jeff [update, not Cornelius Puschmann] wrote:
"I was just checking through some OJS-based journals and noticed that several of them are only in PDF. Hmmm, but a few are in HTML and PDF. It has been a couple of years since I've examined OJS but it seems that OJS provides the tools to generate both HTML and PDF, no? Ironically, I was going to do a quick check of the OJS documentation but found that it's mostly only in PDF!

"I suspect if a journal decides not to provide HTML then it has some perceived limitations with HTML. Often, for scholarly journals, that revolves around the lack of pagination. I noticed one OJS-based journal using paragraph numbering but some editors just don't like that and insist on page numbers for citations. Hence, I would be that's why they chose PDF only."
I think in this case we used only PDF because that was all our (old) version of the OJS platform allowed. I certainly wanted HTML as well. As I said before, we're looking into that, and hope to move to a newer version of the platform soon. I'm not sure it has been an issue, but I believe HTML can be tricky for some kinds of articles (Maths used to be a real difficulty, but maybe they've fixed that now).

I think my preference is for XHTML plus PDF, with the authoritative source article in XML. I guess the workflow should be author-source -> XML -> XHTML plus PDF, where author-source is most likely to be MS Word or LaTeX... Perhaps in the NLM DTD (that seems to be the one people are converging towards, and it's the one adopted by a couple of long term archiving platforms)?

But I'm STILL looking for more concrete ideas on how we should co-present data with our articles!

[Update: Peter Sefton pointed out to me in a comment that I had wrongly attributed a quote to him (and by extension, to everyone); the names being below rather than above the comments in Andy's article. My apologies for such a basic error, which also explains why I had such difficulty finding the blog that Peter's actual comment mentions; I was looking in someone else's blog! I have corrected the names above.

In fact Peter's blog entry is very interesting; he mentions the ICE-RS project, which aims to provide a workflow that will generate both PDF and HTML, and also bemoans how inhospitable most repository software is to HTML. He writes:
"It would help for the Open Access community and repository software publishers to help drive the adoption of HTML by making OA repositories first-class web citizens. Why isn't it easy to put HTML into Eprints, DSpace, VITAL and Fez?

"To do our bit, we're planning to integrate ICE with Eprints, DSpace and Fedora later this year building on the outcomes from the SWORD project – when that's done I'll update my papers in the USQ repository, over the Atom Publishing Protocol interface that SWORD is developing."
So thanks again Peter for bringing this basic error to my attention, apologies to you and others I originally mis-quoted, and I look forward to the results of your efforts! End Update]

OAIS review: what's happening?

In June 2006, there was an announcement:
In compliance with ISO and CCSDS procedures, a standard must be reviewed every five years and a determination made to reaffirm, modify, or withdraw the existing standard. The “Reference Model for an Open Archival Information System (OAIS)” standard was approved as CCSDS 650.0-B-1 in January 2002 and was approved as ISO standard 14721 in 2003. While the standard can be reaffirmed given its wide usage, it may also be appropriate to begin a revision process. Our view is that any revision must remain backward compatible with regard to major terminology and concepts. Further, we do not plan to expand the general level of detail. A particular interest is to reduce ambiguities and to fill in any missing or weak concepts. To this end, a comment period has been established.
Comments were required by 30 October 2006. The Digital Preservation Coalition and the Digital Curation Centre ran a joint workshop on 13 October in Edinburgh, and as a result submitted joint comments. Some of these comments were minor but some were quite significant. There are 14 general recommendations, and many detailed updates and clarifications were suggested.

Supporters of OAIS (and I am one) often make a great play about how open their process was. In that spirit, I have tried to find out what is happening, and to take part. I understood there was to be an open process, with a wiki and telecons, to decide in the first place whether the standard needs revision, and if so to revise it. Despite numerous attempts, I cannot find out what the current state is, or where or how this is taking place. Does anyone know?

This is a very important standard, and it needs revision to make it more useful in today's environment. We need to make it as good as we can.

Wednesday 22 August 2007

Proportion of research output as data or publication?

My colleague Graham Pryor asks in an email:
"Chris - I have been looking for evidence of the proportion of UK research output that can be categorised as scholarly publications and that which is in the form of data. I have found nothing. It is quite possible that no-one has ever tried to work this out. However, on the off-chance, is this a figure (even an estimate) that you might have come across?"
I think this is a great question, even an "Emperor's New Clothes" question. "Everyone knows" that the data proportion has been increasing, but I know of no estimates of the proportions.

Does anyone else know of such estimates? If not, does anyone have any idea how to set about making such an estimate? A very clear definition of data would be needed, to include externally usable data perhaps, rather than raw telemetry...

(Apologies for the gap in posting, I have been sequestered on a Greek island for a couple of weeks, and not thinking of you at all!)