As the research process produces increasing volumes of digital data, the issue of who will look after the data throughout its life cycle comes into sharper focus. As a result, the JISC is funding work designed to better understand the skills, roles, career patterns and training needs of data managers and data curators.
Your views are important in this respect, and we will be very grateful if you would take a few minutes to consider and respond to an online survey on the DCC website. Your responses will help training providers address specific training requirements for digital preservation and data management across different sectors.
The survey is available from now until June 6th on the DCC website at http://www.dcc.ac.uk/jisc/data_projects_questionnaire/
Friday, 23 May 2008
Data management & curation skills: survey
Thursday, 8 May 2008
JISC/CNI meeting, July 2008
This years JISC/CNI meeting will take place in Belfast from 10 - 11 July. Quoting from the website:
"The meeting will bring together experts from the United States, Europe and the United Kingdom. Parallel sessions will explore and contrast major developments that are happening on both sides of the Atlantic. It should be of interest to all senior management in information systems in the education community and those responsible for delivering digital services and resources for learning, teaching and research."
The programme looks to have a bit of something for everyone, including a session on scientific data in repositories and another on infrastructures to support research and learning. More details, including venue and costs, are available from the conference webpages.
Sunday, 4 May 2008
On hearing the first cuckoo in spring
We heard out first cuckoo a week ago (27 April). It set me thinking: "The Times has been publishing 'first cuckoo' letters for a hundred years or so. Surely by now someone will have turned it into a database, as some kind of climate change proxy?".
A Google search - nay, several Google searches - found little direct evidence of this. Best I could find was Nature's Calendar, a site from the Woodland trust for the UK Phenology Network; the page pointed to defaults to plants, but you can select birds and then cuckoos, and see reports of first sighting by date on the UK map. You can even compare different years; I compared 2008 with 1998; several times more cuckoo reports in the later years, but maybe this is because 1998 was the earliest year and their network got better entrenched since then. Nature's Calendar also reports average sightings by year, which did show a later tendency, but usually such averages are suspect if you can't see the underlying data. Anyway, a far cry from my hope of a hundred year proxy database!
But then shock horror, I discovered the underlying premise is false; the Times has not been publishing first cuckoo letters since 1940 or so. They do occasionally publish something, eg:
'On April 21, 1972, Mr Wadham Sutton got away with a letter to the Editor reading: “Sir, Today I heard a performance of Delius’s On Hearing the First Cuckoo in Spring. This is a record.”'Bah, humbug!
Tranche
Tranche sounds interesting. This possibly over-ambitious sound-bite from its web page:
"This project's goal is to solve the problems commonly associated with sharing scientific data, letting you and your collaborators focus on using the data."In effect it is an encrypted, highly distributed file sharing system, independent of any central authority, and suitable for "any size of file", but maybe there are still some sharing problems left for the scientists (;-)?
The Science Commons blog reports:
"The National Cancer Institute will soon be using Tranche to store and share mouse proteomic data from its Mouse Proteomic Technologies Initiative (MPTI). Tranche, a free and open source file sharing tool for scientific data, was one of the earliest testers of CC0. Many thanks to Tranche for providing us with such valuable early feedback on CC0."Tranche uploads are known as projects, and there are apparently 5399 of them so far. The largest number of replicas appears to be 3, and 44 projects are reported as having missing chunks.
Friday, 2 May 2008
Science publishing, workflow, PDF and Text Mining
… or, A Semantic Web for science?
It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.
This latter activity is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.
However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).
Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man, but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I've referred to his arguments in the past, and we've been having a discussion about it over the past few days (see here, its comments, and here).
I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.
One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?
PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF's determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing.
I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.
I think we should tackle this in several ways:
- try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
- try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
- tackle more domain ontologies to get agreements on semantics
- work on microformats and related approaches to allow semantics to be silently encoded in documents
- try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
- try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.
Monday, 28 April 2008
PDF: Preserves Data Forever? Hmm…
It’s always good to see new papers coming out of the DPC. Some fantastic work has been undertaken under the DPC banner over the years, and the organisation has done a great job of raising awareness and contributing to approaches addressing digital preservation. But I was somewhat concerned to read the press release announcing their latest Technology Watch – stating that ‘PDF should be used to preserve information for the future’ and ‘the already popular PDF file format adopted by consumers and business alike is one of the most logical formats to preserve today’s electronic information for tomorrow.’
This is a fairly controversial statement to make. Yes, PDF can have its uses in a preservation environment, particularly for capturing the appearance characteristic of a document. New versions of the PDF reader tend to render old files in the same way as old viewers. It has the advantage of being an open standard, despite being proprietary, and conversion tools are freely and widely available. But, it is not a magic bullet and there are several potential shortcomings – for example, PDFs are commonly created by non-Adobe applications which return varying quality or functionality in PDF files; it’s not useful for preserving other types of digital records such as emails, spreadsheets, websites or databases; it’s not great for machine parsing (as Owen pointed out in a previous comment on this blog) and there are several issues with the PDF standard which even led to development of PDF/A – PDF for Archiving.
To be fair, the press release does later say that the report suggests adopting PDF/A as a potential solution to the problem of long term digital preservation. And the report itself also focuses more on ‘electronic documents’ than electronic information per se, a generic ‘catch all’ phrase that includes types of information for which PDF is just not suitable.
So what is the report all about? Well, essentially it’s an introduction to the PDF family – PDF/A, PDF/X, PDF/E, PDF/Healthcare, and PDF/UA, in a fairly lightweight preservation context. There is some discussion of alternative formats – including TIFF, ODF, and the use of XML (particularly with regards to XPS, Microsoft's XML Paper Specification) – and an overview of current PDF standards development activities. It’s good to have such an easily digestible overview of the general PDF/A specs and the PDF family. But what I really missed in the report was an in-depth discussion of the practical issues surrounding use of PDF for preservation. For example, how can you convert standard PDF files to PDF/A? How do you convert onwards from PDF/A into another format? In which contexts may PDF/A be unsuitable, for example, in light ofa specific set of preservation requirements? What if you required external content links to remain functional, as PDF/A does not allow external content references – what would you get instead? Would the file contain accessible information to tell you that a given piece of text previously used to link to an external reference? And what exactly is the definition of external reference here – external to the document, or external to the organisation? Should links to an external document in the same records series actually be preserved with functionality intact, and is it even possible?
Speaking of preservation requirements, it would have been particularly useful if the report included a discussion of preservation requirements for formats – this would have informed any subsequent selection or rejection of a format, especially in the section on ‘technologies’. The final section on recommendations hints at this, but does not go into detail. There are also a few choice statements that simply left me wondering – one that really caught my eye was ‘this file format may be less valuable for archival purposes as it may be considered to be a native file format’ (p17), which seems to discount the value of native formats altogether. Perhaps the benefits of submission of native formats alongside PDF representations is a subject which deserves more discussion, particularly for preserving structural and semantic characteristics.
I wholeheartedly agree with perhaps the most pertinent comment in the press release – that PDF ‘should never be viewed as the Holy Grail. It is merely a tool in the armoury of a well thought out records management policy’ (Adrian Brown, National Archives). PDF can have its uses, and the report has certainly encouraged more debate in the organisations I work with as to what those uses are. Time will tell as to whether the debate will become broader still.
Friday, 25 April 2008
A Thousand Open Molecular Biology Databases
In January of each year, Nucleic Acids Research (NAR) publishes a special issue on databases for molecular biology research. To be considered, databases have to be open access (they specifically mean browsable without a username, password or payment, although it is possible there are conditions). The staggering thing is, in the past year the number of such databases passed 1,000!
My institution has a subscription to NAR, but the nice thing about the Database Issue is that it has itself been Open Access for the past 4 years or so. So you can check it out for yourself if you are interested.
I thought it might be interesting to go back and trace how they managed to get to 1,000+ databases in 15 years or so. It turns out to be relatively easy to check, back to about the year 2001; prior to that, as far as I can tell, you have to count them yourself. I did the count for 1999, so here’s a little picture of the growth since then (see Burks, 1999):
It is quite a staggering growth.
In recent years, the compilation article has been written by Michael Y. Galperin of the National Center for Biotechnology Information, US National Library of Medicine. He reports in the 2008 article that the complete list and summaries of 1078 databases are available online at the Nucleic Acids Research web site, http://www.oxfordjournals.org/nar/database/a.
This year’s article has some interesting comments on databases; I particularly liked this one on Deja Vu, which uses a tool called eTBLAST “to find highly similar abstracts in bibliographic databases, including MEDLINE…. Some highly similar publications, however, come from different authors and look extremely suspicious” (Galperin, 2008). Curious! As usual he also reports on databases that appear to be no longer maintained and have been dropped from the list (around two dozen this year). Sometimes this is related to (perhaps because of, or maybe the cause of) the content being available in other databases.
There has in fact been relatively little attrition; they claim not to re-use accession numbers, and the highest accession number so far is 1176, implying that just 98 databases have been dropped from the list! Galperin suggests “that the databases that offer useful content usually manage to survive, even if they have to change their funding scheme or migrate from one host institution to another. This means that the open database movement is here to stay, and more and more people in the community (as well as in the financing bodies) now appreciate the importance of open databases in spreading knowledge. It is worth noting that the majority of database authors and curators receive little or no remuneration for their efforts and that it is still difficult to obtain money for creating and maintaining a biological database. However, disk space is relatively cheap these days and database maintenance tools are fairly straightforward, so that a decent database can be created on a shoestring budget, often by a graduate student or as a result of a postdoctoral project. […] Subsequent maintenance and further development of these databases, however, require a commitment that can only be applauded.” (Galperin, 2005)
So is this vanity databasing (this from a blog author, mind)? “In the very beginning of the genome sequencing era, Walter Gilbert and colleagues warned of 'database explosion', stemming from the exponentially increasing amount of incoming DNA sequence and the unavoidable errors it contains. Luckily, this threat has not materialized so far, due to the corresponding growth in computational power and storage capacity and the strict requirements for sequence accuracy.” (Galperin, 2004)
It’s not clear from the quote above how worth while or well-used the databases are. In the 2006 article, Galperin began looking at measures of impact, using the Science Citation Index. We can see from his reference list that he expects the NAR paper to stand proxy for the database. The highest cited were Pfam, GO, UniProt, SMART and KEGG, all highly used “instant classics” (with >100 citations each in 2 years!). However, he writes: “On the other side of the spectrum are the databases that have never been cited in these 2 years, even by their own authors. This does not mean, of course, that these databases do not offer a useful content but one could always suggest a reason why nobody has used this or that database. Usually these databases were too specific in scope and offered content that could be easily found elsewhere.” (Galperin, 2006)
In the 2007 article, Galperin returned to this issue of how well databases are used. “However, citation data can be biased; e.g. in many articles use of information from publicly available databases is acknowledged by providing their URLs, or not acknowledged at all. Besides, some databases could be cited on the web sites and in new or obscure journals, not covered by the ISI Citation Index.” (Galperin, 2007) He then goes on to describe some alternative measures he has investigated to proxy for this citation problem. This is a real issue, I think; data and dataset citations are not made as often or as consistently as they should be, and advice is often conflicting and itself conflicts with the conflicting standards (of which perhaps the best is the NLM standard). Indeed, the NAR articles describing databases seem to stand proxy for the databases: “the user typically starts by finding a database of interest in PubMed or some other bibliographic database, then proceeds to browse the full text in the HTML format. If the paper is interesting enough, s/he would download its text in the PDF format. Finally, if the database turns to be useful, it might be acknowledged with a formal citation.”
This is probably enough for one blog post, but I’ll return, I think, to have a look at some of these databases in a bit more detail.
BURKS, C. (1999) Molecular Biology Database List. Nucl. Acids Res., 27, 1-9. http://nar.oxfordjournals.org/cgi/content/abstract/27/1/1
GALPERIN, M. Y. (2004) The Molecular Biology Database Collection: 2004 update. Nucl. Acids Res., 32, D3-22. http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D3
GALPERIN, M. Y. (2005) The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33.
GALPERIN, M. Y. (2006) The Molecular Biology Database Collection: 2006 update
. Nucleic Acids Research, 34. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D3
GALPERIN, M. Y. (2007) The Molecular Biology Database Collection: 2007 update. Nucleic Acids Research, 35, D3-D4. http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D3
GALPERIN, M. Y. (2008) The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36, D2-D4. http://dx.doi.org/10.1093/nar/gkm1037
