Tuesday, 19 August 2008

Credit from citing datasets?

Cameron Neylon has a thought-provoking post on his Science in the Open blog, arising from a discussion he had at Scifoo with Michael Eisen of Berkeley:
"Michael [...] felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute."
This seems a strong argument to me. A paper describing a research dataset isn't really a research paper, surely? So it is a round peg in a square hole. But if the alternative is that the dataset can only be cited via a research paper (on some other, probably related topic) that mentions it in passing, then this is likely to be a rather poor proxy. The dataset creator may get little credit, and the research paper authors rather more credit, than either was due.

However, if we move to the situation where more datasets are cited directly, then not only do the dataset creators or providers get the credit that's due to them, but also the credit is the right kind. That is, the citation is recognisably for creating or providing a dataset and not for a specific research contribution. I suppose that papers in the Nucleic Acids Review special issue on datasets are also recognisable as creation/provision papers, not research papers, but few other disciplines have such easily recognisable distinctions. So overall simply using data citations looks like a better bet. As Cameron says:
"So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different."
Amen to that!

1 comment:

  1. Completely agree with the arguments made here. One of the other problems Cameron discusses in his post is that credit only means anything when it is realized; in this context by hiring / promotion committees and in the RAE.

    It's a somewhat circular problem - researchers won't be motivated to use data citations unless they're worth some realizable credit, and publishers are unlikely to create mechanisms to enable data citation until there's a demand for it.

    Until then, publishing and citing papers containing datasets kind-of works, and there are some areas where the distinction between data and research is clear enough to have dedicated journals for data (small molecule crystallography is a good example here).

    Are dataset citations a case of the best being the enemy of the good? Could the RAE help?


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.