Monday, 16 July 2007

Open Data... Open Season?

Peter Murray Rust is an enthusiastic advocate of Open Data (the discussion runs right through his blog, this link is just to one of his articles that is close to the subject). I understand him to want to make science data openly accessible for scientific access and re-use. It sounds a pretty good thing! Are there significant downsides?

Mags McGinley recently posted in the DCC Blawg about the report "Building the Infrastructure for Data Access and Reuse in Collaborative Research" from the Australian OAK Law project. This report includes a substantial section (Chapter 4) on Current Practices and Attitudes to Data Sharing, which includes 31 examples, many from the genomics and related areas. Peter MR wants a very strong definition of Open Access (defined by Peter Suber as BBB, for Budapest, Bethesda and Berlin, which effectively requires no restrictions on reuse, even commercially). Although licences were often not clear, what could be inferred in these 31 cases generally would probably not fit the BBB definition.

However, buried in the middle of the report is a cautionary tale. Towards the end of chapter 4, there is a section on risks of open data in relation to patents, following on from experiences in the Human Genome and related projects.
"Claire Driscoll of the NIH describes the dilemma as follows:

It would be theoretically possible for an unscrupulous company or entity to add on a trivial amount of information to the published…data and then attempt to secure ‘parasitic’ patent claims such that all others would be prohibited from using the original public data."
(The reference given is Claire T Driscoll, ‘NIH data and resource sharing, data release and intellectual property policies for genomics community resource projects’ Expert Opin. Ther. Patents (2005) 15(1), 4)

The report goes on:
"Consequently, subsequent research projects relied on licensing methods in an attempt to restrict the development of intellectual property in downstream discoveries based on the disclosed data, rather than simply releasing the data into the public domain."
They then discuss the HapMap (International Haplotype) project, which attempted to make data available while restricting the possibilities for parasitic patenting.
"Individual genotypes were made available on the HapMap website, but anyone seeking to use the research data was first required to register via the website and enter into a click-wrap licence for the use of the data. The licence entered into, the International HapMap Project Public Access Licence, was explicitly modeled on the General Public Licence (GPL) used by open source software developers. A central term of the licence related to patents. It allowed users of the HapMap data to file patent applications on associations they uncovered between particular SNP data and disease or disease susceptibility, but the patent had to allow further use of the HapMap data. The licence specifically prohibited licensees from combining the HapMap data with their own in order to seek product patents..."
Checking HapMap, the Project's Data Release Policy describes the process, but the link to the Click-Wrap agreement says that the data is now open. See also the NIH press release). There were obvious problems, in that the data could not be incorporated into more open databases. The turning point for them seems to be:
"...advances led the consortium to conclude that the patterns of human genetic variation can readily be determined clearly enough from the primary genotype data to constitute prior art. Thus, in the view of the consortium, derivation of haplotypes and 'haplotype tag SNPs' from HapMap data should be considered obvious and thus not patentable. Therefore, the original reasons for imposing the licensing requirement no longer exist and the requirement can be dropped."
So, they don't say the threat does not exist from all such open data releases, but that it was mitigated in this case.

Are there other examples of these kinds of restrictions being imposed? Or of problems ensuing because they have not been imposed, and the data left open? (Note, I'm not at all advocating closed access!)


  1. Hence the need for appropriate licensing?

  2. I have become more and more sceptical about the viability of open licences in scientific data, as the best examples so far has been data in the public domain. While I agree that there is potential for abuse, I think that the perceived danger of parasitic patenting may be overplayed in the Australian report.

    For some outdated musings on this, see this article.

  3. Things get more interesting when you consider countries with database rights.

    At the iCommons Summit in Dubrovnik last month, Rufus Pollock was advocating for CC-style licenses for databases. John Wilbanks, on the other hand, was arguing against them, asking e.g. "What is a 'derivative work' of a database?"

    The patenting concern should be assuaged by the hope that patent offices will find the prior art. Patent office don't have the best record at this, though. As a response, see more recent efforts of the USPTO to work with the open source software community, to ensure prior art in the software world is found and not patented. Similarly, I imagine that bright individuals like those at the Public Patent Foundation would be glad to weigh in with ideas about how to ensure this doesn't happen.

    Other forms of enclosure -- such as with database rights, copyright, or DRM -- are a bit trickier. But my hope is that BSD-licensing will suffice and the GPL is not necessary here -- I don't think that approach makes sense with data.

  4. Gavin - hence my point about licensing. The current draft of the TCL *is* an expression of Database Rights. We're currently working to express similar protections in jurisdictions that don't have such a right.

  5. Paul, I confess that I have not yet looked at your licence, and I hope that our legal person will be able to do so soon. But if CC with their considerable multi-jurisdiction resources is blowing cold on this, shouldn't we be worried about whether Talis has got it right?

  6. When assessing the novelty of a patent application, patent offices and courts will examine what was ‘available to the public’ at the time of filing. However, just placing raw gene sequence data into the public domain does not necessarily make it ‘available to the public’ in a way that would destroy the novelty of a subsequent patent application based on that sequence data.

    This principle is illustrated by the 1990 ruling from the European Patent Office (EPO) Technical Board of Appeal in the Biogen a-Interferon case (T301/87). It was argued here that Biogen’s patent on the gene that encodes for Interferon should be rejected for lack of novelty because the DNA sequence in question had been made published in a well-known gene bank. However, the EPO took the view that disclosure within the gene bank did not make the claimed DNA molecules sufficiently accessible to the public to be part of the state of the art. This was due to the lack of knowledge at the time about the function of the sequence and because the sequence was inter-spliced with ‘junk’ DNA. The Board held that:

    "the situation resembles that prevailing with natural substances ... and is rather like the isolation of a component or bacterium from soil where the same exists in admixture with other useless materials. Thus the idea that the gene bank itself would once and for all anticipate an invention relating to a nucleotide sequence which may be contained therein cannot be sustained".

    We can see from this that public disclosure only destroys the novelty of a later invention if the information it contains, when understood by a person skilled in the art, is sufficient to allow reproduction of the later invention.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.