Tuesday, 17 February 2009

Data Curation for Integrated Science: the 4th Rumsfeld Class!

I had to talk today to a workshop of NERC (Natural Environment Research Council) Data Managers, on data curation for integrated science. Integrated science does seem to be a term that comes up a lot in environmental sciences, for good reasons when contemplating global change. However, there doesn’t seem to be a good definition of integrated science on the web, so I had to come up with my own for the purposes of the talk: “The application of multiple scientific disciplines to one or more core scientific challenges”. The point of this was that scientists operating in integrated science MUST be USING unfamiliar data. Then the implications for data managers, particularly environmental data managers, were that they must make their data available for unfamiliar users. What does this imply?

By some strange mental processes, and a fortuitous Google search, this led me to the Poetry & Philosophy of Donald H Rumsfeld, as exposed to the world by Hart Seely, initially in Slate in April 2003, and now published in a book, “Pieces of Intelligence”, by Seely. These poems are well worth a look for their own particular delight, but the one I was looking for you will probably have heard of in various guises, The Unknown:
‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing
Now this insightful (!) poem (set to music by Bryant Kong, available at http://www.stuffedpenguin.com/rumsfeld/lyrics.htm) perhaps defines 3 epistemological classes:
Known knowns,
Known unknowns, and
Unknown unknowns
Logically there should be a 4th Rumsfeld class: the unknown knowns. And I think this class is especially important for data management for unfamiliar users.

The problem is that in many research projects, there are too many people who “know too much”; with so much shared knowledge, much is un-documented. In OAIS terms, we are looking here at a small, tight Designated Community with a shared Knowledge Base, and consequently little need for Representation Information. In integrated science, and particularly environmental sciences, as the Community broadens and time extends, effectively the need for Representation Information increases. I’m using the terms very broadly here, and RepInfo can be interpreted here in many different ways. But the requirement is to make more explicit the tacit knowledge, the unknown knowns implicit in the data creation and acquisition process.

Interestingly, subsequent speakers seemed to pick up on the idea of making explicit the unknown knowns, so maybe the 4th Rumsfeld Class is here to stay in NERC!


