Thursday, 3 December 2009

IDCC 09 Keynote Address: Douglas Kell

Professor Douglas Kell provided us with a BBSRC perspective on digital data curation and his own perspective as an academic.

He began with the philosophy of data-science, as data curation is not an end in itself, but rather a means to an end. He showed a diagram that he referred to as an arc of knowledge, which demonstrated that one would normally start with an idea or hypothesis, then do an experiment that would produce some data that would either be consistent with the hypothesis or not. The other side of the arc addresses where those ideas come – the data, which is the starting point for many within the audience at IDCC 09. This is data-driven science, when hypotheses are generated from the data, compared to the hypotheses-generate science, which he admits the biological community is quite reliant on. As we move towards more data-driven science we encounter the problem of storing not just the data itself so we can make use of it, but also the knowledge generated by the data – a theme that would be picked up by Mark Birkin in his later talk.

The digital availability of all sorts of stuff changes the entire epistimology of how we do science, making all sorts of things possible.

Historically physics were seen as the high data science, whilst no biology is now being recognised as a high data science. Biology is a short, fat data model, with less data, but more people using it, whilst physics could be described as a long, thin data model. He did note that there is a lot of video and image data related to biological studies that is not being shared because people don't yet know how to handle it.

Kell made the point that if you can access and re-use scientific data you will gain a huge advantage over those who do not, both academically and potentially commercially, which helps to drive funding for this type of work. He used the example of genomics to show just one of the fields generating huge data sets, which could be used for data-drive science, but need to be integrated.

Relative to the cost of creating the data, the cost of storing and curating it is minimal, so it is therefore a good idea to store it effectively. But the issue is not just storage. There is also the cost of moving it. Kell asserted that we need to move to a model where we don't take the data to the computing, but rather take the computing to the data, which will change the way we approach storage and sharing.

Kell moved on to explain that we will need to have a new breed of curators and tools to deal with the challenges, particularly making that data useful. Having the data does not always help, as generally biologists do not have the tools to deal with big data. He expects the type of software to evolve that does not sit locally on the machine, but sits somewhere else and gets changed and updated, but is useable by the scientist without specialist computing knowledge.

Kell pointed out that things do not evolve just from databases getting bigger, but rather from the tools to deal with the data and the curation methods evolving too. To illustrate, he pointed out that five scientific papers per minute are published, so we need a lot of tools to make this vast amount of literature (and the associated data) useful, so it does not just end up in a data tomb.

Some areas of science have better strategies than others, but BBSRC are now looking at making data curation and sharing part of the funding processes, but making sure that data-driven projects are not competing with hypotheses-driven bids. He noted that journalists are keen, funding bodies are keen and the culture will soon change so that NOT managing and sharing data will become distinctly uncool, like smoking.

Finally, Kell emphasised the need to integrate the data and the metadata. He noted that the digital availability of all data has the potential to stop the balkanisation of scientific data, and it is the responsibility of people within the room to ensure this.

  1. Just curious, did he give his source for the five-papers-published-per-minute statistic?


