Digital Curation Blog: January 2010

Thursday, 21 January 2010

Digimap is 10

I had a very enjoyable day yesterday helping EDINA celebrate 10 years of the Digimap service. What began as an eLib project and experiment with 6 Universities in 1996 has grown to a mature service with over 100,000 users, 45,000 of them active, in pretty much every UK University, and soon in UK schools as well.

In 1996 I was Programme Director of the eLib Programme, and my earliest email about Digimap was from the JISC money man, Dave Cook, on 30 January 1996 to Peter Burnhill of the Edinburgh Data Library (as it then was). Dave told Peter we were interested in his idea (for an Images project!) but had a few concerns (that the Ordnance Survey might not agree to let us use their mapping data; it’s hard to remember now how difficult some of those 1990s persuasions were!). Three days later, Dave was offering real money, although it had to be spent by 20 March that year. Done!

By late 1997 the Digimap project (*) had a trial service; I remember experimenting with it and having some problems (this was with Netscape 3 on a PowerMac Duo or something like that; woefully under-powered in retrospect). By the end of 1999, they were moving to a new GIS system, and we were beginning to discuss turning Digimap into a service, and that went live in January 2000. They had to get 37 subscribing Universities by a particular deadline, and I think managed 39 by somewhat earlier.

Since then the service has grown in scope, quality, usage and value. In my personal opinion (full disclosure, I’m not neutral here, having been associated with it through advisory groups of various kinds throughout its life), Digimap is the best service funded by JISC. Best in quality, best in professionalism, best in innovation, best in support. A lot of people deserve credit for that, and EDINA should all be extremely proud of what they have created. By the way, the OS have managed some major shifts in attitude over the years, from suspicious tolerance through to strong support, and the success is partly down to them, and to the efforts of the negotiators in what is now JISC Collections.

As well as various forms of OS mapping for GB (whose trademark names always escape me... and it is GB rather than UK, for weird historical reasons), Digimap now offers 4 “epochs” of historic maps from Landmark, plus Geology maps from BGS and Marine maps from SeaZone. Due to licence restrictions it is only available to registered staff and students at subscribing UK institutions, but I hope that those of you unlucky enough not to fall in that category can soon read more about it on the pages to be put up related to the celebration.

Digimap has been a bit clunky at times compared with the innovations introduced by some others, but with the new underlying GIS, the interfaces are being upgraded; they now have “slippy maps” (called Digimap Roam) on the base service, and it looks really smart and much more functional. It's tough for a small group to keep up with the likes of Google, Yahoo and MS! Soon this slippy map interface will be extended to the Historic service (“Ancient Roam”?), Geology (“Rock’n Roam”?) and Marine (a rather dull “H2Roam”!)… I think those might be internal names, but if you can complete the set with an even punnier marine name, who knows they might keep them!

The day was good fun, and we heard quite a bit about what Digimap is and how it is being used (far more widely than geography departments). The most exciting was a student project using Digimap and a GPS for a light aircraft CFIT-avoidance system (CFIT is Controlled Flight Into Terrain, referred to as “having a bad day”!). We heard from the data suppliers, with a bit more about what’s coming. It was interesting to hear the OS man talking about moves towards Linked Data; I wasn’t sure how that would square with the closed access, but I think I muddled my question (confused Linked Data with OGC web services, I suspect). The service providers didn’t appear to be talking to each other about Linked Data, which might be a good start.

A highlight was the closing keynote from Vanessa Lawrence, CEO of OS, clearly extremely supportive of OS. Choosing her words very carefully (she is not allowed to influence anyone) she outlined the government’s open data initiative and the consultation on its implications for the OS; this consultation closes late March 2010, but she urged us to make any responses, whether collectively or as private citizens well before then. The consultation isn’t simply “should we open up access to OS data?”, it’s much more “how can we open up access to OS data and still sustain the quality of the data into the future”.

The celebration ended with a reception and dinner, with an amusing after-dinner talk by Michael Parker, author of Map Addict. All in all, a very enjoyable and worthwhile day to celebrate a significant anniversary.

PS the twitter tag is #digimap10; I’m not going to tag the post with it, as I’ve got far too many one-time tags that are a pain to manage…

PPS (*) Unfortunately the original Digimap project pages seem to have vanished, and the earliest Wayback Machine gathers appear to be faulty; the first successful gather I can find is

http://web.archive.org/web/20011021051021/edina.ac.uk/digimap/

... which seems to refer to the service, not the project.

Linked data and staff contact pages

You may remember that I am interested in the extent to which we should use Semantic Web (or Linked Data) on the DCC web site. After some discussions, I reached the conclusion that we should do so, but the tools were not ready yet (this isn’t quite an Augustinian “Oh Lord, make me good but not yet”; specifically, we are moving our web site to Drupal 6, the Linked Data stuff will not be native until Drupal 7, and our consultants are not yet up to speed with Linked Data). I have to say that not all our staff are convinced of the benefits of using RDF etc on the web site, and I have had a mental note to write more about this, real soon now.

I was reminded of this recently. I wanted to phone a colleague who worked at UKOLN, one of our partners, and I didn’t have his details in my address book. So I looked on their web site and navigated to his contacts page. Once there I copied his details into the address book, before lifting the phone to give him a ring. After the call (he wasn’t there; the snow had closed the office), I thought about that process. I had to copy all those details! Wouldn’t it be great if I could just import them somehow? How could that be? UKOLN have expertise in such matters, so I tweeted Paul Walk (now Deputy Director, previously technical manager) asking whether they had considered making the details accessible as Linked Data using something like FOAF. You can guess I’m not fully up to speed with this stuff, but I’m certainly trying to learn!

Paul replied that they had considered putting microformats into the page (I guess this is the hCard microformat), and then asked me whether my address book understood RDF, or if I was going to script something? I was pretty sure the answer to the second part was “no” as I suspect such scripting currently is beyond me, and told Paul that I was using MacOSX 10.6 Address Book; it says nothing about RDF, but will import a vcard. I was thinking that if there was appropriate stuff (either hCard microformat or RDFa with FOAF) on the page, I might find an app somewhere that would scrape it off and make a vcard I could import.

Paul’s final tweet was: “@cardcc see the use-case, not sure it's a 'linked data' problem though. What are the links that matter if you're scraping a single contact?”

Well, I couldn’t think of a 140-character answer to that question, which seemed to raise issues I had not thought about properly. What are the links that matter? Was it linked data, or just coded data that I wanted? Is this really a semantic web question rather than linked data? Or is it a RDF question? Or a vocabulary question? Gulp!

After some thought, perhaps Paul was as constrained by his 140 characters as I was. Surely a contacts page contains both facts and links within itself. See the Wikipedia page on FOAF for examples of a FOAF file in turtle for Jimmy Wales; the coverage is pretty much like a contacts page.

So Paul’s contact page says he works for UKOLN at the University of Bath, and gives the latter’s address (I guess formally speaking he works in UKOLN, an administrative unit, and is employed by the University); that his position in UKOLN is Deputy Director, that his phone, fax and email addresses are x, y and z. All of these are relationships between facts, expressible in the FOAF vocabulary. With RDFa, that information could be explicitly encoded in the HTML of the page and understood by machines, rather than inferred from the co-location of some characters on the page (the human eye is much better at such inferences). So there’s RDF, right there. Is that Linked Data? Is it Semantic Web? I’m not really sure.

More to the point, would it have been any greater use to me if it had been so encoded? A FOAF-hunting spider could traverse the web and build up a network of people, and I might be able to query that network, and even get the results downloaded in the form of a vcard that I could import into my Mac Address Book. That sounds quite possible, and the tools may already exist. Or, there may exist an app (what we used to call a Small Matter Of Programming, or a SMOP) that I could point at a web page with FOAF RDFa on it. Perhaps that’s what Paul was after in relation to scripting. Maybe the upcoming Dev8D might find this an interesting task to look at?

What other things could be done with such a page? Well, Paul or others might use it to disambiguate the many Paul Walk alter egos out there. You’ll see I have a simple link to Paul’s contact page above, but if this blog were RDF-enabled, perhaps we could have a more formal link to the assertions on the page, eg to that Paul Walk’s phone number, that Paul Walk’s email address, etc.

Well I’m not sure if this makes sense, and it does feel like one of those “first fax machine” situations. However FOAF has been around for a long while now. Does that mean that folk don’t perceive an advantage in such formal encodings to balance their costs, or is this an absence of value because of a lack of exploitable tools? If so, anyone going to Dev8D want to make an app for me?

(It’s also possible of course that Paul doesn’t want his details to be spidered up in this way, but I guess none of us should put contact details on the web if that’s our position.)

By the way, I found a web page called FOAF-a-matic that will create FOAF RDF for you. Here's an extract from what it created for me, in RDF:

<foaf:Person rdf:ID="me"> <foaf:name>Chris Rusbridge</foaf:name> <foaf:title>Mr</foaf:title> <foaf:givenname>Chris</foaf:givenname>
<foaf:family_name>Rusbridge</foaf:family_name>
<foaf:mbox rdf:resource="mailto:c.rusbridge@xxxxx"/> <foaf:workplaceHomepage rdf:resource="http://www.dcc.ac.uk/"/>
</foaf:Person>

What could I do with that now?

Wednesday, 13 January 2010

Scholarly HTML would be nice, but...

I'm quite interested in the idea of Scholarly HTML, as espoused in Pete Sefton's blog, and I've commented on some of Peter Murray Rust's hamburger PDF comments previously (although I do think a lot of people confuse wild PDF with well-made, should one say Scholarly PDF). I've always been slightly worried by one thing though.

A well-known advantage of PDF is that it pretty much assures I can save a document, share it, move it around etc and it will still be intact and readable. That's one of the reasons it's so popular.

Mostly we don't do that with HTML. Mostly we just point to it. But if I see an article these days, I want it on my computer if I'm allowed; this let's me study it at leisure, drop it in my Mendeley system, etc. As pointed out, that works a treat with PDF, and pretty well with Word or OpenOffice documents as well. This applies even where the document is quite heavily compound, with many embedded images, tables etc.

But if I try saving a HTML document to my hard disk, nothing very standard happens. OK, if I use Safari on my Mac, I get a .webarchive file, which is quite nice as I can do all the things with it that I could do with a PDF and Word etc, and when I open it later it will be as it was before, with all the images in place. But neither IE nor Firefox seem capable of opening a .webarchive file.

If I try saving the same article from Firefox, I get a .html file with the main article in it, and a directory with associated files in it (eg images). Safari does seem capable of opening this combination, but it's pretty ugly, and hard to move around. I haven't tried IE as I don't have easy access to it.

Is there in existence or development a standard approach to packaging the HTML and associated files that would be as convenient as the .webarchive, but usable across all browsers? If so, Scholarly HTML would be that little bit closer!

Persistence of domain names

I had a chat before Christmas with Henry Thompson, who works both in Edinburgh Informatics and also on the W3C TAG. Insofar as the Internet is important in sustaining long term access to information in digital form, there is a sustainability problem that we rather seem to have ignored. Everything on the Internet (literally) depends on domain names, and these are only ever rented. There is no mechanism for permanently reserving a domain name. Domain names can be lost by mistake (overlooking a bill, perhaps having moved in the interim and not informed the relevant domain name registrar), but they can also be lost on business failure. Although domain names can be a business asset, I understand that the registrars have some discretion on transfers, and in particular one cannot make a "domain name will" seeking transfer of the domain name to some benevolent organisation. Note, the mechanism for renting domain names has sustainability advantages, providing sustainability to important services that underpin the DNS.

There are two kinds of problem, one on a massive scale and one more fine-grained. The massive problem is that the entire infrastructure of the Internet depends on URIs, most of which are http URIs that in turn depend on the domain name system. So there are a number of organisations whose domain names are embedded in that infrastructure in a way and to an extent that is very difficult to change. W3C is clearly such an organisation. Many of these organisations seem rather fragile (not a comment on W3C, by the way, although its sustainability model is opaque to me). Should they fail and the domain names disappear, the relevant URIs will cease to work and various pieces of Internet machinery will fall apart.

(By the way, this does seem to be one case where a persistent ID that is independent of the original domain, such as a DOI, has advantages over a HTTP URI plus a redirect table. If the domain name no longer exists, you can't get to a redirect, whereas someone can still relink the DOI to a new location.)

On the more fine-grained scale, many documents (particularly in HTML) are not easily separable from their location, depending on other local files and documents. In addition of course, documents in some sense exist through their citations or bookmarks, that begin to exist separately from the document. Moving a document to a new domain can make it "fail" or disappear. So sustainability is linked to the domain as well as the other preservation factors.

This seems to me to be not at all a technical problem, but it seems to have legal/regulatory, governance, social, business and economic aspects.

Among the solutions might be creating a new top level domain designed for persistence, with different rules of succession, etc. Another (either instead of or in conjunction with the first) might be creating an organisation designed for persistence, to hold endowed domain names. Somehow the ongoing revenue stream for those underpinning services must be retained indefinitely into the future.

We don't think we have the answers, but we do think there is a problem here; I'm not yet sure if we have articulated it accurately at all. I would appreciate any comments. Thanks,

Director of Digital Curation Centre: still time to apply

I’m particularly keen that there be a good slate of candidates for this post, for which applications close on Friday 15 January, 2010. The details can be found at http://www.jobs.ed.ac.uk/vacancies/index.cfm?fuseaction=vacancies.detail&vacancy_ref=3012085 (sorry about the dodgy URL; I hope it works)

The further details say

"The mission of the Digital Curation Centre is to help build capacity, capability and skills for data curation across the UK higher education research community, while supporting and promoting emerging data curation practice. It also has a key role in supporting JISC, especially its new research data management programme. Overall, the DCC is an agent for change, committed to the diffusion of best practice in the curation of digital research data across the Higher Education sector, and providing an authoritative source of advocacy, resources and guidance to the UK research community. This mission is informed by five priorities:

• to identify, gather, record and disseminate curation best practice, providing access to resources, tools, training and information that will equip data practitioners to make informed decisions regarding the management of their data assets;

• to facilitate knowledge exchange between those currently and newly engaged in the generation and management of digital research data;

• to build and support a community of informed practitioners that has the capacity to sustain itself, with the capability to manage and curate its data appropriately;

• to identify crucial and important innovations in data curation, and seek additional resources to provide them;

• to support JISC, especially in its repository, preservation and data management programmes.

To achieve this, the Director must be a persuasive advocate for better curation and management of research data, on a national or international scale. Able to listen and engage with researchers and with research management, publishers and research funders, the Director will build a strong, shared vision of the changes needed, and the ability (working with others in the DCC and beyond) to mobilise the community towards that end."

So if this fits you, or you know someone that it fits, please persuade that person to apply!

By the way, although the DCC may not escape further budget cuts, like all public services in the UK, I have been told that we are funded from core JISC funding rather than capital funding, and as such Phase 3 will not be curtailed as the proposed JISC Managing Research Data programme has been.

[NOTE: This vacancy has now CLOSED; no further applications will be accepted!]

Tuesday, 5 January 2010

Digital Curation Centre User Survey 2009: Highlights

My colleague Angus Whyte has provided the following brief summary of two surveys carried out in Phases 1 and 2 of the Digital Curation Centre, in 2006 and 2009 respectively, as part of our evaluations. In retrospect, we might have done better revising the questions for the second survey rather more than we did; nevertheless I thought it worth while sharing this with you.

Angus writes:

In 2009 DCC users were surveyed, repeating a similar survey carried out in 2006. In the highlights below we draw conclusions both from the more recent results and also changes over the 3 year period. Both surveys were publicised on the DCC website and via several mailing lists, principally the DCC-Associates and (in 2009) the JISC sponsored Research-Dataman list.

Our conclusions take into account that the online questionnaire was self-completed by a self-selected group of respondents (75 in 2009 and 125 in 2006). DCC Associates (640 approx.) provided the bulk of the responses[1]. The results indicated broad patterns, relatively wide differences and consistent responses over the two surveys, even though these are not taken to be statistically representative.

Highlights

In both surveys around 90% of respondents are familiar with the term ‘digital curation’ and regard it as a critical issue within their project or unit. The DCC is consistently given as the main source of information on curation issues by around 70% of respondents, with “on the job challenges/ research” second at around 60%.

Between the two surveys there is a large jump (from 13% to 32%) in the number of respondents indicating that DCC has been “very effective” in raising awareness about digital curation, and those believing it to be “slightly effective” has correspondingly fallen from 53% to 31%.

Of a list of DCC resources, five are identified as “most helpful” by at least 1 in 5 of the 2009 survey respondents, these being (in descending order) the DCC website, Briefing Papers (of various sorts), the DCC Curation Lifecycle Model, Case Studies, and the Digital Curation Manual.

Respondents universally associate digital curation with “ensuring the long-term accessibility and re-usability of digital information”, and large majorities (around 90%) also relate it to “performing archiving activities on digital information such as selection, appraisal and retention” and “ensuring the authenticity, integrity and provenance of digital information are maintained over time”. Rather lower but still significant numbers (around 60%) associate digital curation with “managing digital information from its point of creation” and “managing risks to digital information” – although many more highlight the latter in 2009 (up to 84% from 61%).

Curation or preservation addresses risks to the respondents’ organisations with “loss of organisational memory” consistently topping their list (identified by around 75% of respondents) and “business risks” second, identified by just under half, again across both surveys.

More than two thirds indicate that their main reasons for curating and preserving digital information are its educational/research or historical value; in both years a minority cites other reasons. Similarly, the main obstacles are indicated as financial or staff resources, with around half also indicating lack of awareness or appropriate policies.

For around 40% of respondents, management and preservation of digital information has an indefinite timescale. For a further 15% or so it is “beyond the life of the project/organisation”, and similar numbers indicate these are tasks “for the life of the project/organisation”.

The 2009 survey respondents are no strangers to the ‘data deluge’, most dealing with at least 100Gb and some (7%) more than 100Tb. Overall 79% expect this to increase in the next two years, surprisingly 3% do not, while 7% do not know. Most need to manage a mixture of open and proprietary formats, and report a wide variety of formats in use, predominantly common office applications, PDF documents and multimedia formats. Curation and preservation challenges are most frequently identified with obsolete proprietary formats. Image, video, and geospatial data are also often identified as challenges, as are web sites combining these.

Respondents were also asked in 2009 about re-use, and around a third indicate that research data is re-used internally, with similar numbers offering data generated by their project/unit for re-use by others, or re-using external data.

Access issues facing research projects/units are identified in both surveys and along similar lines; intellectual property rights (e.g. copyright) is the most frequently cited issue, followed by “privacy or ethical issues”, however “embargo on research findings” is least prevalent, identified by only a fifth of respondents.

Asked about funding for curation and preservation, responses show no clear picture. Around half of 2009 respondents indicate funding is “accounted for in project or institutional budget”. A large minority have no explicit funding for curation and preservation, and where resources are available these are pooled from other funded areas (e.g. IT budget for project or organisation) or research grants. Spending on curation/preservation is less than £50,000 (for around half of those respondents who were aware of this). Around half are unsure whether spending will increase or decrease, with the remainder being evenly split.

Detailed questions and response data are available on request.

Angus Whyte, Digital Curation Centre

[1] The DCC Associates membership list includes UK data organisations, leading data curators, overseas and supranational standards agencies, and industrial/business communities. Currently research data creators are under-represented (information from registration details).

Digital Curation Blog

Thursday, 21 January 2010

Digimap is 10

Linked data and staff contact pages

Wednesday, 13 January 2010

Scholarly HTML would be nice, but...

Persistence of domain names

Director of Digital Curation Centre: still time to apply

Tuesday, 5 January 2010

Digital Curation Centre User Survey 2009: Highlights

Creative Commons

Blog Archive

Contributors

Labels