In a content-packed keynote talk, Ed Seidel wanted to give us a preview about what types of project are driving the National Science Foundation's need to think about data, the cyber infrastructure questions, the policy questions and the cultural issues surrounding data that are deeply routed in the scientific community.
To illustrate this initially, Seidel gave the example of some visualisation work on colliding black holes that he had conducted whilst working in Germany with data collected in Illinois, explaining that in order to achieve this he had to do a lot of work on remote visualisation and high performance networking – but that moving the data by network to create the visualisations was not practical, so the team had to fly to Illinois to do the visualisations, then bring the data back. He also cited projects that are already expecting to generate an exabyte of data – vastly more than is currently being produced – so the problem of moving data is only going to get bigger.
Seidel looked first to the cultural issues that influence scientific methods when it comes to the growing data problem. He demonstrated the 400-year-old scientific model of collecting data in small groups or as individuals, writing things down in notebooks and using small amounts of data in that could be measured in kilobytes in modern terms, with calculations carried out by hand. This has not change from Galileo and Newton through to Stephen Hawkins in the 1970's. However, within 20-30 years, the way of doing science changed – with teams of people working on projects using high performance computers to create visualisations of much larger amounts of data. This is a big culture shift, and Seidel pointed out that many senior scientists are still trained in the old method. You now need larger collaborative teams to solve problems and manage the data volumes to do true data-driven science. He used the example of the Hadron Collider, where scientists are looking at generating tens of petabytes of data, which need to be distributed globally to be analysed – with around 15,000 scientists working on around six experiments.
Seidel then went on to discuss how he sees this trend of data sharing developing, using the example of the recent challenge of predicting the route of a hurricane. This involved the sharing of data between several communities to achieve all the necessary modelling to respond to the problem in a short space of time. Seidel calls the groups solving these complex problems “grand challenge communities”. The scientists involved with have three or four days to share data and create models and simulations to solve these problems, but will not know each other! The old modality of sharing data with people that you know will not work and so these communities will have to find ways to come together dynamically to share data if they are going to solve these sorts of problems. Seidel predicted that these issues are going to drive both technical development and policy change.
To illustrate the types of changes already in the pipeline, Seidel cited colleagues who are playing with the use of social networking technologies to help scientists to collaborate – particularly Twitter and Facebook. Specifically, they have set up a system whereby their simulation code tweets its status, and have also been uploading the visualisation images directly into Facebook in order to share it.
Seidel noted that high dimensional, collaborative environments and tremendous amounts of bandwidth are needed, so technical work will be required. The optical networks often don't exist – with universities viewing such systems like the plumbing and funding bodies not looking to support the upgrade of such infrastructure. Seidel argued that we need to find ways to catalyse this sort of investment.
To summarise, Seidel highlighted two big challenges in science trends at the moment: multi-skilled collaborations and the dominance of data, which are both tightly linked. He explained that he had calculated that compute, data and networks have grown 9-12 orders of magnitude in 20-30 years after 400 years unchanged, which shows the scale of the change and the change in culture that it represents.
NSF has a vision document which highlights four main areas – virtual organisations for distributed communities, high performance computing, data visualisation, and learning and work practices. Focusing on the “Data and Visualisation” section, Seidel quoted their dream for data to be routinely deposited in a well-documented form, regularly and easily consulted and analysed by people and are openly accessible, protected and preserved. He admitted this is a dream that is no where near being realised yet. He recognised that there need to be incentives for the changes and new tools to deal with the data deluge. They are looking to develop a national data framework, but emphasised that the scientific community really needs to take the issues to heart.
Taking the role of the scientist, Seidel took us through some of the questions and concerns which a research scientist may raise in the face of this cultural shift. They included concerns about replication of results – which Seidel noted could be a particular problem when services come together in an ad hoc way, but needs to be addressed if the data produced is to be credible.
Seidel moved on to discuss the types of data that need to be considered, in which he included software. He stressed that software needs to be considered as a type of data and therefore needs to be given the same kind of care in terms of archiving and maintenance as traditional scientific collection or observation data. He also includes publications as data, as many of these are now in electronic form.
In discussing the hazards faced, Seidel noted that we are now producing more data each year than we have done in the entirety of human history up to this point – which demonstrates a definite phase change in the amount of data being produced.
The next issue of concern Seidel highlighted was that of providence – particularly how we collect the metadata related to the data that we are considering how to move around. He admitted that we just simply don't know how to do this annotation at the moment, but this is being worked on.
Having identified these driving factors, Seidel explained the observations and workgroup structures that NSF has in place to think more deeply and investigate solutions to these problems, which includes the DataNet project. $100 million is being invested in five different projects as part of this programme. Seidel hopes that this investment will help catalyse the development of a data-intensive science culture. He made some very “apple-pie” idealistic statements about how the NSF sees data, and then used these to explain why the issues are so hard, emphasising the need to engage the library community who have been curating data for centuries, and the need to consider how to enforce data being made available post-award.
Discussions at the NSF are suggesting that each individual project should have a data management policy which is then peer-reviewed. They don't currently have consistency, but but this is the goal.
In conclusion, Seidel emphasised that there are many more difficult cases are coming... However, the benefits of making data available and searchable – potentially with the help of searchable journals and electronic access to data – are great for the progress of science, and the requirement to make many more things available than before if percolating down from the US Government to the funding bodies. Open access to information online is a desirable priority and clarification of policy will be coming soon.