This should probably be titled “are research datasets comprised of facts and does it matter?”. It certainly does appear to matter whether datasets are comprised of facts, as in some legal jurisdictions facts are not copyrightable. If this is so, then without other protection such as the EU Database Right, or perhaps contract law, then there is no basis for licences (you can’t control someone else’s use unless you have a right to exercise that control). This is part of the argument that led Science Commons to abandon attempts to find variants of the Creative Commons licences for datasets and databases, in favour of its proposals for putting datasets into the public domain.
This is an appealing solution in some research contexts, but worrying for other kinds of research. This is not for reasons of profit, but of ethics. Many medical, social science, anthropological, financial and other datasets contain data that are private, perhaps personally, culturally or corporately; these data were usually gathered with some kind of informed consent on use, and have to be protected. They cannot be placed into the public domain. If they are to be made available at all for re-use, there must be terms and conditions attached, ie some kind of licence.
I’m not attempting to argue the legal angle here. But I am interested in the “factness” of the data, that might inform the legal angle.
One might assume the height of Mount Everest is a fact. But check out the Wikipedia article on the subject to see a range of results. One might assume that the physical properties of chemical substances are facts, but check out Chemspider’s approach of assembling different measurements with their provenance (see http://www.chemspider.com/blog/there-are-no-facts-in-science-only-measurement-embedded-within-assumptions.html which links to Jen-Claude Bradley’s earlier UsefulChem article ). Or think of a geospatial database, some elements must be pretty much “skill and judgment” rather than facts, such as the point where a river debouches into the sea. Finally, one might assume that the names of the winners of horse races are facts, and so they are, but only after a race committee has adjudicated on the photo-finish, or whether interference took place.
In practice, what goes into datasets is rarely what is directly measured; it is almost always highly derived through various computations, adjustments and combinations. Environmental sciences can be quite explicit on this, see for example the British Atmospheric Data Centre’s description of the UARS (Upper Atmosphere Research Satellite) data levels. Here level 0 is the raw output data streaming from telemetry and instrumentation, effectively at the level of voltage changes; it is devoid of context. Level 1 data has been converted to the physical properties being measured, but will still be in formats tied to the instrument. Level 2 is post-calibration, and would refer to entities such as calculated geophysical profiles. Level 3 would be gridded and interpolated, and at this level there might be no clear correspondence with any observations (but there should be a clear computational lineage or provenance path linking these steps).
So we seem to be in a situation where datasets contain highly derived data, at some creative distance from direct observations, and what we think of as facts are (or ought to be) contestable consensus based on potentially conflicting evidence,
In fact (hah!) after a while it becomes hard to think of any good example of real science/research data that are facts. The question is, does this matter enough to make any difference?