Michael and colleagues put together a Psycholinguistic database many years ago, with funding from the MRC. At first (1981) it was available via postal request, then (1988) via ftp from the Oxford Text Archive, now via web access from STFC and from UWA(since 1994). Not much change over the years, little effort, free data, no promotion.
The database is now publicly available, eg at the link above. You may see that users are requested to cite the relevant paper:
Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11.The vital piece of evidence was a plot of citations over the years (data extracted from Thomson ISI, I hope they don't mind my re-using it):
You'll notice that citations flowed very slowly in the early days, picked up a little after ftp access was available (it takes a few years to get the research done and published), and then really started to climb after web access was provided. Now Michael and his colleagues are getting around 80 citations per year!
To ram home his point, Michael did some quick investigations, and found "At least 7 of the top 20 UK-based Authors of High-Impact Papers, 2003-07 ranked by citations ... published data sets".
There was some questioning on whether citing data through papers (rather than the dataset itself) was the right approach. Michael is clear where he is on this question: paper citations get counted in the critical organs by which he and others will be measured, so citations should be of papers.
Summary of the ptich: data publishing is easy, it's cheap, and it will boost your career when the new Research Evaluation Framework comes into effect.
Data publishing is easy and cheap when it is this type of data. I'm not arguing against data publication (I'm in favour), but I don't think it is going to be all as easy and cheap as this. At my institution we see data sets that continue to grow (even over 26 years), take up at least gigabytes of space, and may relate to many publications (so it would not necessarily be clear which paper was the most appropriate to cite, and you would not get the concentration of citation that they have acheived here)
ReplyDeleteI think we are going to have to get more sophisticated about what we mean when we talk about 'datasets', more discerning about what we keep, and what we discard, and come to better approaches to citation of data or measuring its impact
All valid points, Owen, and I had certain doubts when adding the "easy" bit, since for many datasets it is not so easy to get it adequately documented for re-use.
ReplyDeleteThe paper, BTW is one describing the dataset, which is the approach used in the Nucleic Acids Review database publication that I have mentioned before; some of those, including the Pfam one, are amongst the highest cited papers. Personally I favour and would work towards a more accepted system of direct dataset citations (for which most of the standards etc already exist), but the point here is that the current reward system is hinged towards the paper, so citing that way gets the rewards!
I'm not sure that 'a paper that describes the dataset' is always going to work well (or at least, not in the same way)
ReplyDeleteIf I have a dataset that grows over time, and publish a paper now, do I also publish a paper describing the current state of the dataset? What happens when I publish again, based on the same 'dataset' 5 years later - does this require a new paper describing the dataset? I suppose the question is whether I have a new dataset or the same one (albeit with more data).