Digital Curation Blog: Persistent IDs

Showing posts with label Persistent IDs. Show all posts

Thursday, 4 February 2010

Persistent identifiers workshop comes round again

It seems to be the one event that people think is important enough to go to, even though they fear in their hearts that, yet again, not a lot of progress will be made. Most of those at yesterday’s JISC-funded Persistent Identifiers workshop yesterday had been to several such meetings before. For my part, I learned quite a lot, but the slightly flat outcome was not all that unexpected. It’s not quite Groundhog Day, as things do move forward slightly from one meeting to the next.

Part of the trouble is in the name. There is this tendency to think that persistent identifiers can be made persistent by some kind of technical solution. To my mind this is a childish belief in the power of magic, and a total abrogation of responsibility; the real issues with “persistent” identifiers are policy and social issues. Basically, far too many people just don’t get some simple truths. If you have a resource which has been given some kind of identifier that resolves to its address (so people can use it), and you change that address without telling those who manage the identifier/resolution, then the identifier will be broken. End of, as they say!

This applies whether you have an externally managed identifier (DOI, Handle, PURL) or an internally managed identifier (eg a well-designed HTTP URI… Paul Walk threatened to throw a biscuit at the first person to mention “Cool URLs”, but had to throw it at himself!).

Now clearly some identifiers have traction in some areas. Thanks to the efforts of CrossRef and its member publishers, the DOI is extremely useful in the scholarly journal literature world. You really wouldn’t want to invent a new identifier for journal articles now, and if you have a journal that doesn’t use DOIs (ahem!), you would be well-advised to sign up. It looks very affordable for a small publisher: $275 per year plus $1 per article.

Even for such a well-established identifier, with well-defined policies and a strong set of social obligations, things do go wrong. I give you Exhibit A, for example, in which Bryan Lawrence discovers that dereferencing a DOI for a 2001 article on his publications list leads to "Content not found" (apologies for the “acerbic” nature of my comment there). It looks like this was due to a failure of two publishers to handle a journal transfer properly; the new publisher made up a new DOI for the article, and abandoned the old one. Aaaaarrrrrrggggghhhhhhh! Moving a resource and giving it a new DOI is a failure of policy and social underpinning (let alone competence) that no persistent identifier scheme can survive! CrossRef does its best to prevent such fiascos occurring, but see social issues above. People fail to understand how important this is, or simple things like: the DOI prefix is not part of your brand!

Whether a DOI is the right identifier to use for research data seems to me a much more open question. The issue here is whether the very different nature of (at least some kinds of) research data would make the DOI less useful. The DataCite group is committed to improving the citability of research data (which I applaud), but also seems to be committed to use of the DOI, which is a little more worrying. While the DOI is clearly useful for a set of relatively small, unchanging digital objects published in relatively small numbers each year (eg articles published in the scholarly literature), is it so useful for a resource type which varies by many orders of magnitude in terms of numbers of objects, rate of production, size of object, granularity of identified subset, and rate of change? In particular, the issue of how a DOI should relate to an object that is constantly changing (as so many research datasets do) appears relatively un-examined.

There was some discussion, interesting to me at least, on the relationships of DOIs to the Linked Data world. If you remember, in that world things are identified by URIs, preferably HTTP URIs. We were told (via the twitter backchannel, about which I might say more later) that DOIs are not URIs, and that the dx.doi.org version is not a DOI (nor presumably is the INFO URI version). This may be fact, but seems to me rather a problem, as it means that "real DOIs" don't work as 1st class citizens of a Linked data World. If the International DOI Foundation were to declare that the HTTP version was equivalent to a DOI, and could be used wherever a DOI could be used, then the usefulness of the DOI as an identifier in a Linked Data world might be greatly increased.

A question that’s been bothering me for a while is when an “arms-length” scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier. We know that such well-managed HTTP identifiers can be extremely persistent; as far as I know all of the eLib programme URIs established by UKOLN in 1995 still work, even though UKOLN web infrastructure has completely changed (and I suspect that those identifiers have outlasted the oldest extant DOI, which must have happened after 1998). Such a local identifier remains under your control, free of external costs, and can participate fully in the Linked Data world; these are quite significant advantages. It seems to me that the main advantage of the set of “arms-length” identifiers is that they are independent of the domain, so they can be managed even if the original domain is lost; at that point, a HTTP URI redirect table could not be set up. So I’m afraid I joked on twitter that perhaps “use of a DOI was a public statement of lack of confidence in the future of your organisation”. Sadly I missed waving the irony flag on this, so it caused a certain amount of twitter outrage that was unintentional!

In fact the twitter backchannel was extremely interesting. Around a third or so of the twits were not actually at the meeting, which of course was not apparent to all. And it is in the nature of a backchannel to be responding to a heard discourse, not apparent to the absent twits; in other words, the tweets represent a flawed and extremely partial view of the meeting. Some of those who were not present (who included people in the DOI world, the IETF and big publishers) seemed to get quite the wrong end of the stick about what was being said. On the other hand, some external contributions were extremely useful and added value for the meat-space participants!

I will end with one more twitter contribution. We had been talking a bit about the publishing world, and someone asked how persistent are academic publishers. The tweet came back from somewhere “well, their salespeople are always ringing us up ;-) !

Wednesday, 13 January 2010

Persistence of domain names

I had a chat before Christmas with Henry Thompson, who works both in Edinburgh Informatics and also on the W3C TAG. Insofar as the Internet is important in sustaining long term access to information in digital form, there is a sustainability problem that we rather seem to have ignored. Everything on the Internet (literally) depends on domain names, and these are only ever rented. There is no mechanism for permanently reserving a domain name. Domain names can be lost by mistake (overlooking a bill, perhaps having moved in the interim and not informed the relevant domain name registrar), but they can also be lost on business failure. Although domain names can be a business asset, I understand that the registrars have some discretion on transfers, and in particular one cannot make a "domain name will" seeking transfer of the domain name to some benevolent organisation. Note, the mechanism for renting domain names has sustainability advantages, providing sustainability to important services that underpin the DNS.

There are two kinds of problem, one on a massive scale and one more fine-grained. The massive problem is that the entire infrastructure of the Internet depends on URIs, most of which are http URIs that in turn depend on the domain name system. So there are a number of organisations whose domain names are embedded in that infrastructure in a way and to an extent that is very difficult to change. W3C is clearly such an organisation. Many of these organisations seem rather fragile (not a comment on W3C, by the way, although its sustainability model is opaque to me). Should they fail and the domain names disappear, the relevant URIs will cease to work and various pieces of Internet machinery will fall apart.

(By the way, this does seem to be one case where a persistent ID that is independent of the original domain, such as a DOI, has advantages over a HTTP URI plus a redirect table. If the domain name no longer exists, you can't get to a redirect, whereas someone can still relink the DOI to a new location.)

On the more fine-grained scale, many documents (particularly in HTML) are not easily separable from their location, depending on other local files and documents. In addition of course, documents in some sense exist through their citations or bookmarks, that begin to exist separately from the document. Moving a document to a new domain can make it "fail" or disappear. So sustainability is linked to the domain as well as the other preservation factors.

This seems to me to be not at all a technical problem, but it seems to have legal/regulatory, governance, social, business and economic aspects.

Among the solutions might be creating a new top level domain designed for persistence, with different rules of succession, etc. Another (either instead of or in conjunction with the first) might be creating an organisation designed for persistence, to hold endowed domain names. Somehow the ongoing revenue stream for those underpinning services must be retained indefinitely into the future.

We don't think we have the answers, but we do think there is a problem here; I'm not yet sure if we have articulated it accurately at all. I would appreciate any comments. Thanks,

Monday, 21 April 2008

RLUK launched... but relaunch flawed?

Neil Beagrie reminds us that after 25 years, the Consortium of University (and?) Research Libraries (CURL) has relaunched itself as RLUK:

"On Friday 18th April the Consortium of Research Libraries (CURL) celebrated its 25th anniversary and launched it new organisational title: Research Libraries UK (RLUK). A warm welcome to RLUK and best wishes for the next 25 years!"

Congratulations to them... well, maybe. I had a quick look for some key documents; here's a URL I forwarded to my colleagues a year or so ago: http://www.curl.ac.uk/about/E-ResearchNeedsAnalysisRevised.pdf. Or, more recently, try something on their important HEFCE UKRDS Shared Services Study: http://www.curl.ac.uk/Presentations/Manchester%20November%2007/SykesHEFCEStudy2.pdf. Both give me a big fat "Page not found". In the latter case, when I find their tiny search box, and search for UKRDS, I get "Your search yielded no results".

I am really, desperately sad about this. Remember all the fuss about URNS? Remember all we used to say about persistent IDs? Remember "Cool URIs don't change"? The message is, persistent URIs require commitment; they require care. They don't require a huge amount of effort (its simply a redirection table, after all). But libraries should be in the forefront of making this work. I have emailed RLUK, without response so far. Come on guys, this is IMPORTANT!

Oh and just in case you think this is isolated, try looking for that really important, seminal archiving report, referenced everywhere at http://www.rlg.org/ArchTF/. I had something to do with the RLG merger into OCLC that caused that particular snafu, and after making my feelings known have been told that "We're taking steps to address not only DigiNews links but those of other pages that are still referred to from other sites and personal bookmarks". The sad thing about that particular report, you might discover, is that it doesn't appear to be archived on the Wayback machine, either, I suspect because it had a ftp URL.

[UPDATE The report does now appear on the OCLC website at http://www.oclc.org/programs/ourwork/past/digpresstudy/default.htm. When I first searched for Waters Garrett from the OCLC home page a few weeks ago, I couldn't find it. I guess they haven't quite got round to building the redirection table yet... but that can take time.]

Grump!

Digital Curation Blog

Thursday, 4 February 2010

Persistent identifiers workshop comes round again

Wednesday, 13 January 2010

Persistence of domain names

Monday, 21 April 2008

RLUK launched... but relaunch flawed?

Creative Commons

Blog Archive

Contributors

Labels