Thursday 4 February 2010

Persistent identifiers workshop comes round again

It seems to be the one event that people think is important enough to go to, even though they fear in their hearts that, yet again, not a lot of progress will be made. Most of those at yesterday’s JISC-funded Persistent Identifiers workshop yesterday had been to several such meetings before. For my part, I learned quite a lot, but the slightly flat outcome was not all that unexpected. It’s not quite Groundhog Day, as things do move forward slightly from one meeting to the next.

Part of the trouble is in the name. There is this tendency to think that persistent identifiers can be made persistent by some kind of technical solution. To my mind this is a childish belief in the power of magic, and a total abrogation of responsibility; the real issues with “persistent” identifiers are policy and social issues. Basically, far too many people just don’t get some simple truths. If you have a resource which has been given some kind of identifier that resolves to its address (so people can use it), and you change that address without telling those who manage the identifier/resolution, then the identifier will be broken. End of, as they say!

This applies whether you have an externally managed identifier (DOI, Handle, PURL) or an internally managed identifier (eg a well-designed HTTP URI… Paul Walk threatened to throw a biscuit at the first person to mention “Cool URLs”, but had to throw it at himself!).

Now clearly some identifiers have traction in some areas. Thanks to the efforts of CrossRef and its member publishers, the DOI is extremely useful in the scholarly journal literature world. You really wouldn’t want to invent a new identifier for journal articles now, and if you have a journal that doesn’t use DOIs (ahem!), you would be well-advised to sign up. It looks very affordable for a small publisher: $275 per year plus $1 per article.

Even for such a well-established identifier, with well-defined policies and a strong set of social obligations, things do go wrong. I give you Exhibit A, for example, in which Bryan Lawrence discovers that dereferencing a DOI for a 2001 article on his publications list leads to "Content not found" (apologies for the “acerbic” nature of my comment there). It looks like this was due to a failure of two publishers to handle a journal transfer properly; the new publisher made up a new DOI for the article, and abandoned the old one. Aaaaarrrrrrggggghhhhhhh! Moving a resource and giving it a new DOI is a failure of policy and social underpinning (let alone competence) that no persistent identifier scheme can survive! CrossRef does its best to prevent such fiascos occurring, but see social issues above. People fail to understand how important this is, or simple things like: the DOI prefix is not part of your brand!

Whether a DOI is the right identifier to use for research data seems to me a much more open question. The issue here is whether the very different nature of (at least some kinds of) research data would make the DOI less useful. The DataCite group is committed to improving the citability of research data (which I applaud), but also seems to be committed to use of the DOI, which is a little more worrying. While the DOI is clearly useful for a set of relatively small, unchanging digital objects published in relatively small numbers each year (eg articles published in the scholarly literature), is it so useful for a resource type which varies by many orders of magnitude in terms of numbers of objects, rate of production, size of object, granularity of identified subset, and rate of change? In particular, the issue of how a DOI should relate to an object that is constantly changing (as so many research datasets do) appears relatively un-examined.

There was some discussion, interesting to me at least, on the relationships of DOIs to the Linked Data world. If you remember, in that world things are identified by URIs, preferably HTTP URIs. We were told (via the twitter backchannel, about which I might say more later) that DOIs are not URIs, and that the dx.doi.org version is not a DOI (nor presumably is the INFO URI version). This may be fact, but seems to me rather a problem, as it means that "real DOIs" don't work as 1st class citizens of a Linked data World. If the International DOI Foundation were to declare that the HTTP version was equivalent to a DOI, and could be used wherever a DOI could be used, then the usefulness of the DOI as an identifier in a Linked Data world might be greatly increased.

A question that’s been bothering me for a while is when an “arms-length” scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier. We know that such well-managed HTTP identifiers can be extremely persistent; as far as I know all of the eLib programme URIs established by UKOLN in 1995 still work, even though UKOLN web infrastructure has completely changed (and I suspect that those identifiers have outlasted the oldest extant DOI, which must have happened after 1998). Such a local identifier remains under your control, free of external costs, and can participate fully in the Linked Data world; these are quite significant advantages. It seems to me that the main advantage of the set of “arms-length” identifiers is that they are independent of the domain, so they can be managed even if the original domain is lost; at that point, a HTTP URI redirect table could not be set up. So I’m afraid I joked on twitter that perhaps “use of a DOI was a public statement of lack of confidence in the future of your organisation”. Sadly I missed waving the irony flag on this, so it caused a certain amount of twitter outrage that was unintentional!

In fact the twitter backchannel was extremely interesting. Around a third or so of the twits were not actually at the meeting, which of course was not apparent to all. And it is in the nature of a backchannel to be responding to a heard discourse, not apparent to the absent twits; in other words, the tweets represent a flawed and extremely partial view of the meeting. Some of those who were not present (who included people in the DOI world, the IETF and big publishers) seemed to get quite the wrong end of the stick about what was being said. On the other hand, some external contributions were extremely useful and added value for the meat-space participants!

I will end with one more twitter contribution. We had been talking a bit about the publishing world, and someone asked how persistent are academic publishers. The tweet came back from somewhere “well, their salespeople are always ringing us up ;-) !

3 comments:

  1. "we were told.. that DOIs are not URIs, and that the dx.doi.org version is not a DOI, nor persumably the Info URI version."

    This sounds like religious dogma to me, not technical infrastructure planning.

    Perhaps the dx.doi.org version and the info URI version are not 'DOIs', but they ARE _both_ URI versions of DOIs. I don't see why they can't be used for a URI identifying the same document identified by the DOI. You register a DOI, you just got for free persistent-ish URIs corresponding to that DOI too. What's wrong with using them?

    Now, the organizational commitment to the persistence of the dx.doi.org URI might be somewhat unclear. It might stop resolving. And the linked data folks love their resolvability. Also, there may very well be OTHER URIs that also resolve to the DOI, the 'dx.doi.org' one is not "canonical", it's just one CrossRef (not the only DOI registrar) provides.

    The info URI is indeed canonical. And it's not resolvable from the start, which is a heretical sin to the RDF religionists, but I'm not so sure it's so bad. Stuart Weibel has a written a bit arguing for the business case for unresolvable identifiers in some cases. http://weibel-lines.typepad.com/weibelines/2006/08/uncoupling_iden.html

    The info URI is also the one that has the organizational commitment to persistence (of course commiting to persistence for a URI that's not resolvable is less of a commitment).

    ReplyDelete
  2. The question of 'when an "arms-length" scheme, like PURL, Handle, DOI etc is preferable to a well-managed local HTTP identifier' is crucial. I have never heard a good answer to it. I think the only good answer is the one that you give: when the organization goes away, and the domain does as well, that is the only time such a system is useful.

    This is one reason why I am partial to the ARK scheme, if you are going to use a permanent identifier scheme, because this is the primary problem that it addresses.

    As far as I am concerned, DOI & handles & (to some extent) PURLs are products of an earlier time when it was not totally clear how big HTTP would become. It is now clear. There is no going back. Just use HTTP URIs.

    As an aside. You think $275/yr + $1/article is affordable - well, I don't know. That would be a lot for some publishers, especially for something they get for free by having a decently run website.

    bibwild: The reason that a dx.doi.org URI is not a DOI is presumably an issue of syntax.

    ReplyDelete
  3. Yes, a dx.doi.org URI is not a DOI. Instead, it is a URI coresponding to a DOI, that will resolve to the same place as the DOI. And so, what's the problem?

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.