Friday, 25 April 2008

A Thousand Open Molecular Biology Databases

In January of each year, Nucleic Acids Research (NAR) publishes a special issue on databases for molecular biology research. To be considered, databases have to be open access (they specifically mean browsable without a username, password or payment, although it is possible there are conditions). The staggering thing is, in the past year the number of such databases passed 1,000!

My institution has a subscription to NAR, but the nice thing about the Database Issue is that it has itself been Open Access for the past 4 years or so. So you can check it out for yourself if you are interested.

I thought it might be interesting to go back and trace how they managed to get to 1,000+ databases in 15 years or so. It turns out to be relatively easy to check, back to about the year 2001; prior to that, as far as I can tell, you have to count them yourself. I did the count for 1999, so here’s a little picture of the growth since then (see Burks, 1999):

It is quite a staggering growth.

In recent years, the compilation article has been written by Michael Y. Galperin of the National Center for Biotechnology Information, US National Library of Medicine. He reports in the 2008 article that the complete list and summaries of 1078 databases are available online at the Nucleic Acids Research web site,

This year’s article has some interesting comments on databases; I particularly liked this one on Deja Vu, which uses a tool called eTBLAST “to find highly similar abstracts in bibliographic databases, including MEDLINE…. Some highly similar publications, however, come from different authors and look extremely suspicious” (Galperin, 2008). Curious! As usual he also reports on databases that appear to be no longer maintained and have been dropped from the list (around two dozen this year). Sometimes this is related to (perhaps because of, or maybe the cause of) the content being available in other databases.

There has in fact been relatively little attrition; they claim not to re-use accession numbers, and the highest accession number so far is 1176, implying that just 98 databases have been dropped from the list! Galperin suggests “that the databases that offer useful content usually manage to survive, even if they have to change their funding scheme or migrate from one host institution to another. This means that the open database movement is here to stay, and more and more people in the community (as well as in the financing bodies) now appreciate the importance of open databases in spreading knowledge. It is worth noting that the majority of database authors and curators receive little or no remuneration for their efforts and that it is still difficult to obtain money for creating and maintaining a biological database. However, disk space is relatively cheap these days and database maintenance tools are fairly straightforward, so that a decent database can be created on a shoestring budget, often by a graduate student or as a result of a postdoctoral project. […] Subsequent maintenance and further development of these databases, however, require a commitment that can only be applauded.” (Galperin, 2005)

So is this vanity databasing (this from a blog author, mind)? “In the very beginning of the genome sequencing era, Walter Gilbert and colleagues warned of 'database explosion', stemming from the exponentially increasing amount of incoming DNA sequence and the unavoidable errors it contains. Luckily, this threat has not materialized so far, due to the corresponding growth in computational power and storage capacity and the strict requirements for sequence accuracy.” (Galperin, 2004)

It’s not clear from the quote above how worth while or well-used the databases are. In the 2006 article, Galperin began looking at measures of impact, using the Science Citation Index. We can see from his reference list that he expects the NAR paper to stand proxy for the database. The highest cited were Pfam, GO, UniProt, SMART and KEGG, all highly used “instant classics” (with >100 citations each in 2 years!). However, he writes: “On the other side of the spectrum are the databases that have never been cited in these 2 years, even by their own authors. This does not mean, of course, that these databases do not offer a useful content but one could always suggest a reason why nobody has used this or that database. Usually these databases were too specific in scope and offered content that could be easily found elsewhere.” (Galperin, 2006)

In the 2007 article, Galperin returned to this issue of how well databases are used. “However, citation data can be biased; e.g. in many articles use of information from publicly available databases is acknowledged by providing their URLs, or not acknowledged at all. Besides, some databases could be cited on the web sites and in new or obscure journals, not covered by the ISI Citation Index.” (Galperin, 2007) He then goes on to describe some alternative measures he has investigated to proxy for this citation problem. This is a real issue, I think; data and dataset citations are not made as often or as consistently as they should be, and advice is often conflicting and itself conflicts with the conflicting standards (of which perhaps the best is the NLM standard). Indeed, the NAR articles describing databases seem to stand proxy for the databases: “the user typically starts by finding a database of interest in PubMed or some other bibliographic database, then proceeds to browse the full text in the HTML format. If the paper is interesting enough, s/he would download its text in the PDF format. Finally, if the database turns to be useful, it might be acknowledged with a formal citation.”

This is probably enough for one blog post, but I’ll return, I think, to have a look at some of these databases in a bit more detail.

BURKS, C. (1999) Molecular Biology Database List. Nucl. Acids Res., 27, 1-9.

GALPERIN, M. Y. (2004) The Molecular Biology Database Collection: 2004 update. Nucl. Acids Res., 32, D3-22.

GALPERIN, M. Y. (2005) The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33.

GALPERIN, M. Y. (2006) The Molecular Biology Database Collection: 2006 update
. Nucleic Acids Research, 34.

GALPERIN, M. Y. (2007) The Molecular Biology Database Collection: 2007 update. Nucleic Acids Research, 35, D3-D4.

GALPERIN, M. Y. (2008) The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36, D2-D4.

1 comment:

  1. Hi, nice post! Sorry its taken me so long to find it! I was just doing some research into the history of the Molecular Biology Database Collection when I found this post. I am working on updating this article in MetaBase:

    Molecular Biology Database Collection - MetaBase

    MetaBase is a "database of biological databases" that is heavily derived from the Molecular Biology Database Collection, but also includes over 100 'user contributed' database articles and over 100 database articles taken from BMC Biology and BMC Bioinformatics.

    I've been developing MetaBase for a couple of years now, and given the nature of this post (and this blog!) I thought you would be interested to take a look at it. The original aims of MetaBase were very ambitious, but I think we are getting there (albeit slowly!).


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.