Thursday, 11 September 2008

Many citations flow from data...

I've been at the UK e-Science All Hands Meeting in Edinburgh over the past few days (easy, since it's being held in the bulding in which I work!). Lots of interesting presentations; far too many to go to, let alone blog about. But I can't resist mentioning one short presentation (PPT), from Prof Michael Wilson of STFC. His pitch was simple: publishing data is good for your career, especially now. And he has evidence to back up his claims!

Michael and colleagues put together a Psycholinguistic database many years ago, with funding from the MRC. At first (1981) it was available via postal request, then (1988) via ftp from the Oxford Text Archive, now via web access from STFC and from UWA(since 1994). Not much change over the years, little effort, free data, no promotion.

The database is now publicly available, eg at the link above. You may see that users are requested to cite the relevant paper:
Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11.
The vital piece of evidence was a plot of citations over the years (data extracted from Thomson ISI, I hope they don't mind my re-using it):

You'll notice that citations flowed very slowly in the early days, picked up a little after ftp access was available (it takes a few years to get the research done and published), and then really started to climb after web access was provided. Now Michael and his colleagues are getting around 80 citations per year!

To ram home his point, Michael did some quick investigations, and found "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets".

There was some questioning on whether citing data through papers (rather than the dataset itself) was the right approach. Michael is clear where he is on this question: paper citations get counted in the critical organs by which he and others will be measured, so citations should be of papers.

Summary of the ptich: data publishing is easy, it's cheap, and it will boost your career when the new Research Evaluation Framework comes into effect.


  1. Data publishing is easy and cheap when it is this type of data. I'm not arguing against data publication (I'm in favour), but I don't think it is going to be all as easy and cheap as this. At my institution we see data sets that continue to grow (even over 26 years), take up at least gigabytes of space, and may relate to many publications (so it would not necessarily be clear which paper was the most appropriate to cite, and you would not get the concentration of citation that they have acheived here)

    I think we are going to have to get more sophisticated about what we mean when we talk about 'datasets', more discerning about what we keep, and what we discard, and come to better approaches to citation of data or measuring its impact

  2. All valid points, Owen, and I had certain doubts when adding the "easy" bit, since for many datasets it is not so easy to get it adequately documented for re-use.

    The paper, BTW is one describing the dataset, which is the approach used in the Nucleic Acids Review database publication that I have mentioned before; some of those, including the Pfam one, are amongst the highest cited papers. Personally I favour and would work towards a more accepted system of direct dataset citations (for which most of the standards etc already exist), but the point here is that the current reward system is hinged towards the paper, so citing that way gets the rewards!

  3. I'm not sure that 'a paper that describes the dataset' is always going to work well (or at least, not in the same way)

    If I have a dataset that grows over time, and publish a paper now, do I also publish a paper describing the current state of the dataset? What happens when I publish again, based on the same 'dataset' 5 years later - does this require a new paper describing the dataset? I suppose the question is whether I have a new dataset or the same one (albeit with more data).


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.