Thursday 26 March 2009

Are research data facts and does it matter?

This should probably be titled “are research datasets comprised of facts and does it matter?”. It certainly does appear to matter whether datasets are comprised of facts, as in some legal jurisdictions facts are not copyrightable. If this is so, then without other protection such as the EU Database Right, or perhaps contract law, then there is no basis for licences (you can’t control someone else’s use unless you have a right to exercise that control). This is part of the argument that led Science Commons to abandon attempts to find variants of the Creative Commons licences for datasets and databases, in favour of its proposals for putting datasets into the public domain.

This is an appealing solution in some research contexts, but worrying for other kinds of research. This is not for reasons of profit, but of ethics. Many medical, social science, anthropological, financial and other datasets contain data that are private, perhaps personally, culturally or corporately; these data were usually gathered with some kind of informed consent on use, and have to be protected. They cannot be placed into the public domain. If they are to be made available at all for re-use, there must be terms and conditions attached, ie some kind of licence.

I’m not attempting to argue the legal angle here. But I am interested in the “factness” of the data, that might inform the legal angle.

One might assume the height of Mount Everest is a fact. But check out the Wikipedia article on the subject to see a range of results. One might assume that the physical properties of chemical substances are facts, but check out Chemspider’s approach of assembling different measurements with their provenance (see which links to Jen-Claude Bradley’s earlier UsefulChem article ). Or think of a geospatial database, some elements must be pretty much “skill and judgment” rather than facts, such as the point where a river debouches into the sea. Finally, one might assume that the names of the winners of horse races are facts, and so they are, but only after a race committee has adjudicated on the photo-finish, or whether interference took place.

In practice, what goes into datasets is rarely what is directly measured; it is almost always highly derived through various computations, adjustments and combinations. Environmental sciences can be quite explicit on this, see for example the British Atmospheric Data Centre’s description of the UARS (Upper Atmosphere Research Satellite) data levels. Here level 0 is the raw output data streaming from telemetry and instrumentation, effectively at the level of voltage changes; it is devoid of context. Level 1 data has been converted to the physical properties being measured, but will still be in formats tied to the instrument. Level 2 is post-calibration, and would refer to entities such as calculated geophysical profiles. Level 3 would be gridded and interpolated, and at this level there might be no clear correspondence with any observations (but there should be a clear computational lineage or provenance path linking these steps).

So we seem to be in a situation where datasets contain highly derived data, at some creative distance from direct observations, and what we think of as facts are (or ought to be) contestable consensus based on potentially conflicting evidence,

In fact (hah!) after a while it becomes hard to think of any good example of real science/research data that are facts. The question is, does this matter enough to make any difference?


  1. This gets even trickier in the States, because "sweat of the brow" is insufficient to confer a database copyright. The question at hand then becomes whether all the data-twiddling is sufficiently original to qualify the whole for a compilation copyright.

    Even in that case, though, a single datum is uncopyrightable in the US.

  2. Interesting epistemologically, but legally the fact that you apply computations to derive a 'fact' still doesn't make it copyrightable in the US.

    To be copyrightable, it needs to have 'originality' and possess a 'spark of creativity'.

    Now, that can be argued in court for a given putative set of 'facts'.

    But where people often get confused talking about this is people think that because it took a lot of human work to generate something, that means it MUST be creative. In fact, the court decision that established that facts aren't copyrightable in the US specifically disavowed this; previously the 'sweat of the brow' doctrine said that the amount of work it took to produce something was relevant to copyright. Not any more. You may have worked long and hard on figuring out (your theory of) the height of Everest, but that doesn't make that number copyrightable by you.

    I don't know believe scientific data being copyrightable has ever actually been tested in court. Maybe it is, maybe it's not. Under the current law, it seems unlikely to me that a list of numbers is every going to be copyrightable, no matter how scientifically creative you had to be to come up with it. But laws change with court decisions, as well as with legislation.

  3. Chris...this was fun to read. There have been many exchanges over "facts" on ChemSpider especially when it comes to things like "names and identifiers". I am an NMR spectroscopist and while data are the output it is the interpretation of the data that is where the greatest value comes. The same data can give different outcomes as described here: The "facts" about research data are that they are measured by someone, at some time, in some way and on some specific sample. The conclusions made from those measurements are made up from our training, our experiences and our intuition. Often they are correct but for sure we are all open to making mistakes/misinterpretations. There are so many cyclic arguments about data, facts, interpretations and knowledge. You only have to look at the Open Solubility Challenge work done by JC Bradley and his team to see that data, while facts (i.e. they are measured) , can be very different.

  4. If you spend too long thinking about the epistemology it's easy to persuade yourself that nothing is a fact or a truth. Or rather, that there are no facts or truth...

    Surely what's relevant here is not how we can / could interpret the concept of 'fact', but how it was used in the bits of legislation that assert that they're different from non-facts.

    To return to the title of your post; it probably mainly matters to have an argument to make in case you end up being the long awaited 3rd test case for the database rights legislation. When there finally is a test case in this area, the sceptical voice in my head tells me that the epistemology will matter less than the lawyers' hourly rates...

  5. I would like to take this opportunity to explain some of the research we have undertaken in the OAK Law Project ( and conclusions we have reached regarding copyright protection of data compilations in Australia. We have two primary publications addressing this area: Building the Infrastructure for Data Access and Reuse in Collaborative Research: An Analysis of the Legal Context ( and Practical Data Management: A Legal and Policy Guide (

    s10(1) of the Australian Copyright Act 1968 defines a literary work to include a "compilation". This is where protection for data compilations under Australian law derives from. Any data that is collected, arranged, organised and presented in a logical fashion will usually be regarded as a compilation.

    Chris makes a good point that many data compilations will require a great deal of effort, analysis and creativity. In the US, creativity is a requirement before a data compilation can be protected by copyright. In Australia, creativity is not required. Only that the compilation is a result of the exercise of skill, knowledge or judgment in the arrangement of the data, or the investment of substantial labour or expense in collection the material (Desktop Marketing v Telstra).

    It can often be difficult to tell whether a compilation is one that would attract copyright protection. In our work, we have tended to err on the side of caution and assume that most compilations will attract copyright protection. This is because the threshold in Australia is so low. The main case in this area, Desktop Marketing v Telstra, involved the copying of a telephone directory. A telephone directory is merely a compilation of names and numbers listed in alphabetical order. If this is a compilation that attracts copyright, then most other compilations are likely to be protected by copyright under Australian law as well.

    Copyright law does not protect mere facts or information. Rather, it protects the expression of facts or information in a material form. This means that generally there would not be a problem with copying some of the basic facts contained in a compilation. For example, if I were to list the names and numbers of a small collection of my colleagues on my website, that would not usually be a problem. I have extracted the data that I need, in a fairly "random" fashion (in that I have not just copied a few pages of names and numbers in alphabetical order directly from the White Pages). I have not copied the way that the data is arranged in the telephone directory (the “expression”).

    In regards to Science Commons’ decision to discontinue advocating the application of Creative Commons licences to data compilations, my understanding is that they came to this decision for two reasons:
    (1) It was not always clear in the US whether the relevant compilation attracted copyright. If it did not but a person had put a CC licence on the compilation in the mistaken belief that it did, then restrictions would have been imposed on that dataset (e.g. that it could only be used non-commercially) which actually had no legal basis for being imposed; and
    (2) CC licences all contain an attribution requirement and Science Commons were concerned about what they call "attribution stacking" - i.e. where a dataset is compiled from data contributed by many different researchers, it would be extremely difficult for a user to attribute all of those researchers.

    At OAK Law, we still believe that CC licences can be applied to datasets in Australia because the concerns noted by Science Commons do not arise to the same degree in Australia. Firstly, we have a lower threshold test for copyright protection, meaning that copyright will more readily attach to datasets in Australia and the first problem noted by Science Commons is less likely to occur. Nevertheless, to be sure, we usually advocate that the widest CC licence - the attribution only licence - be applied to datasets. Secondly, unlike in the US, Australian copyright law includes Moral Rights, meaning that creators have to be attributed anyway, regardless of whether a CC licence is applied or not. We think there are various ways of getting around the "attribution stacking" problem - for example, a group of researchers could agree on a common way to be attributed (e.g. we could be attributed as "the OAK Law Project"), or the data could be attributed using a URL, which an interested party can visit and which can list all the contributors (and this list can be added to over time). The advantage of applying CC licences to data, in our view, is that it provides some certainty to users about what they can and cannot do with that data.

    Kylie Pappalardo
    Research Assistant
    Law School
    Queensland University of Technology

  6. Kylie, Thanks for writing up the detailed conclusions of your research. Very useful information.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.