Graham Pryor spotted an
item on the CARMEN blog, pointing to a
Business Week article (from 2007, we later realised) about a commercial pharma (
Novartis) making research data from its Type 2 Diabetes studies available on the web. This seemed to me an interesting thing to explore (as a data person, not a genomics scientist), both for what it was, and for how they did it.
I could not find a reference to these data on the Novartis site, but I did find a reference to a similar claim dating back to 2004, made in the
Boston Globe and then in some press releases from the Broad Institute in Cambridge, MA, referring to their joint work with Novartis (eg
initial announcement,
first results and
further results). The first press release identified David Altshuler as the PI, and he was kind enough to respond to my emails and point me to their
pages that link to the studies and to the results they are making available.
Why make the data available? The Boston Globe article said "Commercially, the open approach adopted by Novartis represents a calculated gamble that it will be better able to capitalize on the identification of specific genes that play a role in Type 2 diabetes. The firm already has a core expertise in diabetes. Collaborating on the research will give its scientists intimate knowledge of the results."
The Business Week article said "...the research conducted by Novartis and its university partners at MIT and Lund University in Sweden merely sets the stage for the more complex and costly drug identification and development process. According to researchers, there are far more leads than any one lab could possibly follow up alone. So by placing its data in the public domain, Novartis hopes to leverage the talents and insights of a global research community to dramatically scale and speed up its early-stage R&D activities."
Thus far, so good. Making data available for un-realised value to be exploited by others is at the heart of the digital curation concept. There are other comments on these announcements that cynically claim that the data will have already been plundered before being made accessible; certainly the PIs will have first advantage, but there is nothing wrong with that. The data availability itself is a splendid move. It would be very interesting to know if others have drawn conclusions from the data (I did not see any licence terms, conditions, or even requests such as attribution, although maybe this is assumed as scientific good practice in this area).
Business Week go on to draw wider conclusions:
"The Novartis collaboration is just one example of a deep transformation in science and invention. Just as the Enlightenment ushered in a new organizational model of knowledge creation, the same technological and demographic forces that are turning the Web into a massive collaborative work space are helping to transform the realm of science into an increasingly open and collaborative endeavor. Yes, the Web was, in fact, invented as a way for scientists to share information. But advances in storage, bandwidth, software, and computing power are pushing collaboration to the next level. Call it Science 2.0."
I have to say I'm not totally convinced about the latter phrase. Magazines like Business Week do like buzz-words like Science 2.0, but so far comparatively little science is affected by this kind of "radicalsharing". Genomics is definitely one of the poster children in this respect, but the vast majority of science continues to be lab or small group based with an orientation towards publishing results as papers, not data.
So what have they made available? There are 3 diabetes projects listed:
- Whole Genome Scan for Type 2 Diabetes in a Scandinavian Cohort
- Family-based linkage scan in three pedigrees with extreme diabetes phenotypes
- A Whole Genome Admixture Scan for Type 2 Diabetes in African Americans
The second of these does not appear to have data available online. The 3rd project has results data in the form of an
Excel spreadsheet, with 20 columns and 1294 rows; the data appear relatively simple (a single sheet, with no obvious formulae or Excel-specific issues that I could see), and could probably have been presented just as easily as CSV or another text variant. There's a small amount of header text in row 2 that spans columns, plus some colour coding, that may have justified the use of Excel. Short to medium term access to these data should be simple.
The first project shows two different types of results, with a lot more data: Type 2 Diabetes results and Related Traits results. The Type 2 Diabetes results comprise a figure in JPEG or PDF, plus data in two forms: a HTML table of "
top single-marker and multi-marker results", and a tab-delimited text file (suitable for analysis with
Haploview 4.0) of "
all single-marker and multi-marker results". These data are made available both as the initial release of February 2007, and an updated release from March 2007. There is a link to
Instructions for using the results files, effectively short-hand instructions for feeding the data into Haploview and doing some analyses on them. The HTML table is just that; data in individual cells are numbers or strings, without any XML or other encoding. There are links to entries in NCBI, HapMap and Ensembl, however.
The
Related Traits results also come in an initial release (also February 2007) and an updated release from September 2007. The results again have a summary, a table this time but still in JPEG or PDF form. The detailed results are more complex; there is a HTML table of traits in 4 groups (Glucose, Obesity, Lipid and Blood Pressure), and for each trait (eg Fasting Glucose) up to 4 columns of data. The first column is a description of the trait as a PDF, the next is a link to a HTML Table of Top Single Marker Results for Association, the next is a link to a text Table of All Single Marker Results for Association, and the last is a link to a text table of Phenotype summary statistics by genotype (both these have the same format as above, although the latter has different columns).
It seems clear that there is a lot of data here; how useful they are to other scientists is not for me to judge. Certainly a scientist looking through these pages could form judgments on the usefuleness and relevance of these data to his or her work. There's not much to help a robot looking for science data from the Internet. I'm not sure what form such information might take, although there are examples in Chemistry. Perhaps the data cells should be automatically encoded according to a relevant ontology, so that the significance of the data travels with them. Possibly
microformats or
RDFa could have (or come to have) some relevance. However, both the HTML and text formats are very durable, (more so than the Excel format for project 3) and should be easily accessible (or transformed into later forms) at least as long as the Broad Institute wishes to continue to make them available.