Metadata, in his environment, represents the limiting factor. A critical part of Bryan’s argument on costs relates to the limits on human ability to do tasks, particularly certain types of repetitive tasks. We will never look at all our data, so we must automate, in particular we must process automatically on ingest. Metadata really matters to support this.
Bryan dashed past an interesting classification of metadata, which from his slides is as follows:
- A – Archival (and I don’t think he means PREMIS here: “normally generated from internal metadata”)
- B – Browse: context, generic, semantic (NERC developing a schema here called MOLES: Metadata Objects for Linking Environmental Sciences)
- C – Character and Citation: post-fact annotation and citations, both internal and external
- D – Discovery metadata, suitable for harvesting into catalogues: DC, NASA-DIF, ISO19115/19139 etc
- E – Extra: Discipline-specific metadata
- O – Ontology (RDF)
- Q – Query: defined and supported text, semantic and spatio-temporal queries
- S – Security metadata
Can choose NOT TO TAKE THE DATA (but the act of not taking the data is resource intensive). Bryan showed and developed a cost model based on 6 components; it’s worth looking at his slides for this. But the really interesting stuff was on his conclusions on limits, with 25 FTE:
"• We can support o(10) new TYPES of data stream per year.Interestingly, they charge for current storage costs for 3 years (only) at ingest time; by then, storage will be “small change” provided new data, with new storage requirements, keep arriving. Often the money arrives first and the data very much later, so they may have to be a “banker” for a long time. They have a core budget that covers administration, infrastructure, user support, and access service development and deployment. Everything changes next year however, with their new role supporting the Inter-Governmental Panel on Climate Change, needing Petabytes plus.
• We can support doing something manually if it requires at most:
– o(100) activities of a few hours, or
– o(1000) activities of a few minutes, but even then only when supported by a modicum of automation.
• If we have to [do] o(10K) things, it has to be completely automated and require no human intervention (although it will need quality control).
• Automation takes time and money.
• If we haven’t documented it on ingestion, we might as well not have it …
• If we have documented it on ingestion, it is effectively free to keep it …
• … in most cases it costs more to evaluate for disposal than keeping the data.
• (but … it might be worth re-ingesting some of our data)
• (but … when we have to migrate our information systems, appraisal for disposal becomes more relevant)”