I am very happy that Issue 2 of the International Journal of Digital Curation (IJDC) is now out. This open access journal issue contains 7 peer-reviewed papers and 7 general articles.
I am listed as editor, and did write the editorial, but Richard Waller of UKOLN did all the hard editorial work, in between his day job on Ariadne. Not to mention our authors, of course! My heartfelt thanks to all concerned.
We have a few articles in the pipeline for Issue 3, but I do want to get up a good head of steam. So if you, dear reader, care about digital curation, the care and feeding of science data, the relationship of data to publication, the rights and wrongs of access to data, and/or digital preservation, and have some research results to put forward, please do write us a paper or an article!
We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!
For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet.
If anyone knows how to do this, do please get in touch!
Tuesday 31 July 2007
Wednesday 25 July 2007
Digital Curation Conference: Remember the Call for Papers
There is less than 3 weeks to go before the Call for Papers for the 3rd International Digital Curation Conference closes, on 15 August 2005. I do want to encourage researchers to submit research papers on digital curation to this conference. We had a good set of papers last year in Glasgow, and we are hoping for an even stronger field this year, when the Conference will be held in Washington DC. We are trying to get funding to support some travel, especially for younger speakers.
Tuesday 24 July 2007
Question on approaches to curating textual material
Dave Thompson, Digital Curator at the Wellcome Library asked a question on the Digital Preservation list (which is not well set up for discussion just now). I've replied, but we agreed I would adapt my reply for the blog for any further discussion that might emerge.
You might want to look at a posting in David Rosenthal's blog on Format Obsolescence as the Prostate Cancer of Preservation (for younger and/or non-male curators, the reference is that many more men die WITH prostate cancer than die because of it.) Lots of food for thought there, and some of the same themes I was addressing in my Ariadne article a year or so ago.
The simplest answer to your question is "it depends". If you've got lots of money, and given the state of flux right now in the word processing market, I would suggest doing both (1) and (3): that is make sure you preserve your ingested bits un-changed, but also create a "normalised" copy in your favourite open format.
What format should that be? Well for Word at the moment it might be sticky. PDF (strictly PDF/A if we're into preservation) might be appropriate. However as far as ever extracting useful science from the document is concerned, the PDF is a hamburger (as Peter Murray Rust says; he reports Mike Kay as the origin: "Converting PDF to XML is a bit like converting hamburgers into cow"). PDF is useful where you want to treat something exactly as page images; it is also probably much less useful for documents like spreadsheets (where the formulae are important).
Open Document Format is an international standard (ISO/IEC 26300:2006) supported by Open Source code with a substantial user and developer base, so its long term sustainability should be pretty strong. I've heard that there can be glitches in the conversions, but I have no experience (the Mac does not seem to be quite so well served). Office Open XML has been ratified by ECMA, and is moving (haltingly?) towards an ISO standard. Presumably its conversion process will be excellent, but I don't know of much open source code base yet. However the user base is enormous, and MS seems to be getting some messages from its users about sustainability. Nah, right now I would guess ODF wins for preservation.
It may not apply in this case, but often there is a trade-off between the extent of the work you do to ensure preservation (and the complexity and cost of that work), and the amount of stuff you can preserve. Your budget is finite, right? You can't spend the money twice. So if you over-engineer your preservation process you will preserve less stuff. The longevity of the stuff in AHDS, it turns out, was affected much more by a policy change than by any of the excellent work they did preserving it. You need to do a risk analysis to work out what to do (which is not quite the same as a crystal ball; few would have seen the AHRC policy change coming!).
It's also probably true that half or more of the stuff you preserve will not be accessed for a very long time, if ever. Trouble is (as the captains of industry are reported to say about the usefulness of their marketing budgets, or librarians about their acquisitions) you don't know in advance which half.
Greg Janee of the NDIIP NGDA project gave a presentation at the DCC (PPT) a couple of years ago, in which he introduced Greg's equation:
What I'm arguing for is not putting too much of the cost onto ingest, but leave as much as reasonable to the eventual end user. After all, YOU pay the ingest cost. Strangely, so, in a way, does the potential end user whose stuff was not preserved if you spent too much on ingest. You do need to do enough to make sure that end use is feasible, and indeed appropriate in relation to comparator archives (you don't want to be the least-used archive in the world). You also must include, in some sense or other the Representation Information to make end use possible.
But you don't have to constantly migrate your content to current formats to make it point-and-click available; in fact it may be a disservice to your users to do so. Migration on request has always seemed to me a sensible approach (I think it was first demonstrated by Mellor, Wheatley & Sergeant (Mellor 2002 *) from the CAMiLEON project based on earlier work in the CEDARS project, but also demonstrated by LOCKSS). This seems pretty much your second approach; you just have to ensure you retain a tool that will run in a current (future) environment, able to migrate the information. Unless you have control of the tool, this might suddenly get hard (when the tool vendor drops support for older formats).
I've often thought, for this sort of file type, that something like the OpenOffice.org suite might be the right base for a migration tool. After all, someone's already written the output stage and will keep it up to date. And many input filters have also already been written. If you're missing one, then form a community and write it, presto the world has support for another defunct word processor format (yeah I know, it's not quite that easy!).
I was going to argue against your option 3 (although it's what most repositories do just now). But I think I've talked myself round to it being a reasonable possibility. I would add a watching brief, though: you might decide at some point that the stuff was getting too high risk, and that some kind of migration tool should be provided (in which case you're back to option 2, really).
I get annoyed when I hear people say (what I probably also used to say) that institutional repositories are not for preservation. It's like Not-for-Profit companies; they may not be for profit, but they'd better not be for loss (I used to be on the Board of two). Repositories are not for loss. They keep stuff. Cheaply. And to date, as far as I can see, quite as well as expensive preservation services!
* MELLOR, P., WHEATLEY, P. & SERGEANT, D. (2002) Migration on Request, a Practical Technique for Preservation. Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. Proceedings.
"I'm looking for arguments for and against when, and if, digital material should be normalised. I'm thinking about the long term management of textual material in proprietary formats such as MS Word. I see three basic approaches on which I'm seeking the lists comments and thoughts.Dave, the questions you ask have been rumbling on for years. The answers, reasonably enough, keep changing. Partly depending on who asks and who answers, but also depending on the time and the context. So that's a lot of help isn't it?
The first approach normalises textual material at the point of ingestion, converting all incoming material to a neutral format such as XML immediately. This would create an open format manifestation with the aim of long term sustainable management.
The second approach would be one of 'wait and see', characterised by recognising that if a particular format isn't immediately 'at risk' of obsolescence why touch it until some form of migration becomes necessary at some future point.
The third approach preserves the bitstream as acquired and delivers it in an unmodified form upon request, ie MS Word in – MS Word out.
The first approach requires tools, resources and investment immediately. The second requires these same resources, and possibly more, in the future. The future requirements for the third approach are perhaps unknown aside from that of adequate technical metadata.
I'm interested in ideas about the sustainability of these approaches, the costs of one approach over the other and the perceived risks of moving material to an open format sooner rather than later. I'd be very interested in examples of projects which have taken either approach."
You might want to look at a posting in David Rosenthal's blog on Format Obsolescence as the Prostate Cancer of Preservation (for younger and/or non-male curators, the reference is that many more men die WITH prostate cancer than die because of it.) Lots of food for thought there, and some of the same themes I was addressing in my Ariadne article a year or so ago.
The simplest answer to your question is "it depends". If you've got lots of money, and given the state of flux right now in the word processing market, I would suggest doing both (1) and (3): that is make sure you preserve your ingested bits un-changed, but also create a "normalised" copy in your favourite open format.
What format should that be? Well for Word at the moment it might be sticky. PDF (strictly PDF/A if we're into preservation) might be appropriate. However as far as ever extracting useful science from the document is concerned, the PDF is a hamburger (as Peter Murray Rust says; he reports Mike Kay as the origin: "Converting PDF to XML is a bit like converting hamburgers into cow"). PDF is useful where you want to treat something exactly as page images; it is also probably much less useful for documents like spreadsheets (where the formulae are important).
Open Document Format is an international standard (ISO/IEC 26300:2006) supported by Open Source code with a substantial user and developer base, so its long term sustainability should be pretty strong. I've heard that there can be glitches in the conversions, but I have no experience (the Mac does not seem to be quite so well served). Office Open XML has been ratified by ECMA, and is moving (haltingly?) towards an ISO standard. Presumably its conversion process will be excellent, but I don't know of much open source code base yet. However the user base is enormous, and MS seems to be getting some messages from its users about sustainability. Nah, right now I would guess ODF wins for preservation.
It may not apply in this case, but often there is a trade-off between the extent of the work you do to ensure preservation (and the complexity and cost of that work), and the amount of stuff you can preserve. Your budget is finite, right? You can't spend the money twice. So if you over-engineer your preservation process you will preserve less stuff. The longevity of the stuff in AHDS, it turns out, was affected much more by a policy change than by any of the excellent work they did preserving it. You need to do a risk analysis to work out what to do (which is not quite the same as a crystal ball; few would have seen the AHRC policy change coming!).
It's also probably true that half or more of the stuff you preserve will not be accessed for a very long time, if ever. Trouble is (as the captains of industry are reported to say about the usefulness of their marketing budgets, or librarians about their acquisitions) you don't know in advance which half.
Greg Janee of the NDIIP NGDA project gave a presentation at the DCC (PPT) a couple of years ago, in which he introduced Greg's equation:
Item is worth preserving for time duration T if:... ie given a low probability of usage in time T, preservation has to be very cheap!
(intrinsic value) * ProbT(usage) > SumT(preservation costs) + (cost to use)
What I'm arguing for is not putting too much of the cost onto ingest, but leave as much as reasonable to the eventual end user. After all, YOU pay the ingest cost. Strangely, so, in a way, does the potential end user whose stuff was not preserved if you spent too much on ingest. You do need to do enough to make sure that end use is feasible, and indeed appropriate in relation to comparator archives (you don't want to be the least-used archive in the world). You also must include, in some sense or other the Representation Information to make end use possible.
But you don't have to constantly migrate your content to current formats to make it point-and-click available; in fact it may be a disservice to your users to do so. Migration on request has always seemed to me a sensible approach (I think it was first demonstrated by Mellor, Wheatley & Sergeant (Mellor 2002 *) from the CAMiLEON project based on earlier work in the CEDARS project, but also demonstrated by LOCKSS). This seems pretty much your second approach; you just have to ensure you retain a tool that will run in a current (future) environment, able to migrate the information. Unless you have control of the tool, this might suddenly get hard (when the tool vendor drops support for older formats).
I've often thought, for this sort of file type, that something like the OpenOffice.org suite might be the right base for a migration tool. After all, someone's already written the output stage and will keep it up to date. And many input filters have also already been written. If you're missing one, then form a community and write it, presto the world has support for another defunct word processor format (yeah I know, it's not quite that easy!).
I was going to argue against your option 3 (although it's what most repositories do just now). But I think I've talked myself round to it being a reasonable possibility. I would add a watching brief, though: you might decide at some point that the stuff was getting too high risk, and that some kind of migration tool should be provided (in which case you're back to option 2, really).
I get annoyed when I hear people say (what I probably also used to say) that institutional repositories are not for preservation. It's like Not-for-Profit companies; they may not be for profit, but they'd better not be for loss (I used to be on the Board of two). Repositories are not for loss. They keep stuff. Cheaply. And to date, as far as I can see, quite as well as expensive preservation services!
* MELLOR, P., WHEATLEY, P. & SERGEANT, D. (2002) Migration on Request, a Practical Technique for Preservation. Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. Proceedings.
Friday 20 July 2007
Subject "versus" institutional repositories
There's a concept in maths called "closed but unbounded". I'm not sure it's exactly to the point (I hope that's a pun), but "subjects" seem a bit like that. You can be pretty sure about most of the stuff that's not in a subject (or "domain"), and most of the stuff that is in it, but you can be very puzzled about some of the edges, and can find yourself in some extremely surprising discussions at times about parts of subjects that challenge most of the ideas you had. So subjects turn out to be very un-bounded. (They also tend to fracture, productively.) Perhaps not surprisingly, subjects don't tend to have assets, bank balances, etc. You might say, in those senses, subjects don't exist! They do nevertheless have very real approaches, common standards, ontologies, methods, vocabularies, literatures... and passionate adherents spread across institutions.
Institutions on the other hand, or at least universities, tend to be very material. They do have assets, bank balances, policies, libraries, employees, continuity on a significant scale, even (in the US at least) endowments. They have temporal stability and mass. They collect scholars and scientists in various domains... even if the scientists give their loyalty to their subjects, and are held together only by salaries and a common loathing of the university car parking policy!
Institutions have continuity, and they have libraries, and archives, which in serious ways express that continuity. Libraries are not about print. Libraries are now squarely about knowledge and information expressed in data, whether they know it or not. And the continuity of valuable data is an important reason for libraries to be involved.
But institutions are generic, and libraries are generic, even in more focused institutions like MIT. The library, the archive, the IR, in different ways, are about collecting elements of the scholarly discourse that contribute both globally and locally. So institutional repositories are about generic continuity of data, as libraries are about continuity of collections. IRs create value for the institution, even if it is only a small piece of value (like most other individual "collections" in an institution). If you don't play, you aren't in the game. You know data has value, just not which bits. You need to disclose your scholarly assets, across the spectrum; you can feel proud of doing so, and make a case for local benefit at the same time. You are an institution taking part in a global system; the value may be in the network, but you are part of that network.
But the way an IR treats data is necessarily generic; if you get data from chemistry, engineering, social sciences and performing arts into this under-funded but potentially valuable repository, you will do your best but it will necessarily be variants of generic practice, at best.
So back to the "subject"; if there is a data repository here, it is likely staffed by "domain experts", capable of taking on a "community proxy" role. They know their stuff. They will treat their data in domain-specific ways; they will know where to seek out data to complement their collection, they will know how to make connections between different parts. They can describe it appropriately, they can develop standards with their colleagues. They will know how to help their colleague scientists extract maximum value. Some subject repository managers are seriously concerned about the problems for disciplines if institutional repositories expand into the data "space".
What subject repositories don't usually have, is what institutions have: substantial assets, endowments, bank balances, tenured staff. Usually based around multiple project grants, 5-year core funding is a prized goal at a price of cheese-paring funding, and mid-term reviews every second year. Subject repositories don't have assured continuity, temporal mass.
The NSB LLDDC (Long-lived digital data collections) report, and now the NSF CyberInfrastructure strategy, are aimed at this area; they have spotted the fragility of these subject data collections. In the UK we have possibly even more of a patchwork of funding mechanisms than was observed in the LLDDC report. JISC used to be a significant funder of subject repositories, but in recent years has been retrenching from them, while building up massive funding in IRs. AHRC, as we have seen, is pulling back from funding the AHDS.
So what would make this better? I'd like to see a substantive discussion about the roles and funding mechanisms of subject and institutional repositories. In the UK, this would have to involve at least the Research Councils, Wellcome Trust and JISC. (Perhaps looks less likely than it did when I first wrote this.)
Secondly, I'd like to see JISC in the final tranche of its capital funding (here's the circular recently closed) explore the bounds of what's possible with the data provider/ service provider combination (maybe OAI/ORE will address this a little? Maybe not!). And what if curation is detached from the repository? What if data continuity/preservation is separated from the curation service? Do these questions even make sense?
Maybe a system or federation of sustainable IRs internally divided into sets on subject lines (and hence externally aggregatable along those lines), with subject-oriented curation activities picking up on "invisible college" volunteerism might work? Splitting curation into generic and domain elements... Or other notions, pushing the skill out into the network, the federation, but retaining the data where the assets and continuity lie?
[This posting is based on an email I sent to a closed JISC Repositories advisory group some time ago; it seems even more relevant today...]
Institutions on the other hand, or at least universities, tend to be very material. They do have assets, bank balances, policies, libraries, employees, continuity on a significant scale, even (in the US at least) endowments. They have temporal stability and mass. They collect scholars and scientists in various domains... even if the scientists give their loyalty to their subjects, and are held together only by salaries and a common loathing of the university car parking policy!
Institutions have continuity, and they have libraries, and archives, which in serious ways express that continuity. Libraries are not about print. Libraries are now squarely about knowledge and information expressed in data, whether they know it or not. And the continuity of valuable data is an important reason for libraries to be involved.
But institutions are generic, and libraries are generic, even in more focused institutions like MIT. The library, the archive, the IR, in different ways, are about collecting elements of the scholarly discourse that contribute both globally and locally. So institutional repositories are about generic continuity of data, as libraries are about continuity of collections. IRs create value for the institution, even if it is only a small piece of value (like most other individual "collections" in an institution). If you don't play, you aren't in the game. You know data has value, just not which bits. You need to disclose your scholarly assets, across the spectrum; you can feel proud of doing so, and make a case for local benefit at the same time. You are an institution taking part in a global system; the value may be in the network, but you are part of that network.
But the way an IR treats data is necessarily generic; if you get data from chemistry, engineering, social sciences and performing arts into this under-funded but potentially valuable repository, you will do your best but it will necessarily be variants of generic practice, at best.
So back to the "subject"; if there is a data repository here, it is likely staffed by "domain experts", capable of taking on a "community proxy" role. They know their stuff. They will treat their data in domain-specific ways; they will know where to seek out data to complement their collection, they will know how to make connections between different parts. They can describe it appropriately, they can develop standards with their colleagues. They will know how to help their colleague scientists extract maximum value. Some subject repository managers are seriously concerned about the problems for disciplines if institutional repositories expand into the data "space".
What subject repositories don't usually have, is what institutions have: substantial assets, endowments, bank balances, tenured staff. Usually based around multiple project grants, 5-year core funding is a prized goal at a price of cheese-paring funding, and mid-term reviews every second year. Subject repositories don't have assured continuity, temporal mass.
The NSB LLDDC (Long-lived digital data collections) report, and now the NSF CyberInfrastructure strategy, are aimed at this area; they have spotted the fragility of these subject data collections. In the UK we have possibly even more of a patchwork of funding mechanisms than was observed in the LLDDC report. JISC used to be a significant funder of subject repositories, but in recent years has been retrenching from them, while building up massive funding in IRs. AHRC, as we have seen, is pulling back from funding the AHDS.
So what would make this better? I'd like to see a substantive discussion about the roles and funding mechanisms of subject and institutional repositories. In the UK, this would have to involve at least the Research Councils, Wellcome Trust and JISC. (Perhaps looks less likely than it did when I first wrote this.)
Secondly, I'd like to see JISC in the final tranche of its capital funding (here's the circular recently closed) explore the bounds of what's possible with the data provider/ service provider combination (maybe OAI/ORE will address this a little? Maybe not!). And what if curation is detached from the repository? What if data continuity/preservation is separated from the curation service? Do these questions even make sense?
Maybe a system or federation of sustainable IRs internally divided into sets on subject lines (and hence externally aggregatable along those lines), with subject-oriented curation activities picking up on "invisible college" volunteerism might work? Splitting curation into generic and domain elements... Or other notions, pushing the skill out into the network, the federation, but retaining the data where the assets and continuity lie?
[This posting is based on an email I sent to a closed JISC Repositories advisory group some time ago; it seems even more relevant today...]
Open Data Licensing: is your data safe?
Over on the Nodalities blog, Rob Styles wrote about some of the aspects of open data licensing, and the tricky questions of copyright versus database right. OK, yawn. Let me put that another way… over on the Nodalities blog, Rob Styles writes about whether you can make your data openly accessible on the web without getting totally ripped off in the process. A bit less of a yawn?
One key quote:
The problem is that there is doubt… OK, more than doubt… whether and/or how Copyright applies to databases. And if Copyright does not apply, you don’t get the exclusive control which allows you to apply a conditional licence like Creative Commons. Just to explore a bit further...
Science Commons was set up to look at helping make science data more openly available. But if you look at their FAQ, you can see some real concerns. They pick out several aspects of a database that might be subject to Copyright, including the structure, but also say:
As I’ve said before, I’m not a lawyer. Can a data-oriented lawyer comment?
One key quote:
“Without appropriate protection of intellectual property we have only two extreme positions available: locked down with passwords and other technical means; or wide open and in the public-domain. Polarising the possibilities for data into these two extremes makes opening up an all or nothing decision for the creator of a database.It’s true: to put any conditions over the use of our data, we have to have an exclusive right to control it. Copyright gives its owner that right for a text. If I own the Copyright for my works, I can (and try to) put a Creative Commons licence on it, to allow others to use it but to ask them to give me attribution if they do so.
With only technical and contractual mechanisms for protecting data, creators of databases can only publish them in situations where the technical barriers can be maintained and contractual obligations can be enforced.”
The problem is that there is doubt… OK, more than doubt… whether and/or how Copyright applies to databases. And if Copyright does not apply, you don’t get the exclusive control which allows you to apply a conditional licence like Creative Commons. Just to explore a bit further...
Science Commons was set up to look at helping make science data more openly available. But if you look at their FAQ, you can see some real concerns. They pick out several aspects of a database that might be subject to Copyright, including the structure, but also say:
"In the United States, data will be protected by copyright only if they express creativity. Some databases will satisfy this condition, such as a database containing poetry or a wiki containing prose. Many databases, however, contain factual information that may have taken a great deal of effort to gather, such as the results of a series of complicated and creative experiments. Nonetheless, that information is not protected by copyright and cannot be licensed under the terms of a Creative Commons license."In a note to me Mags McGinley, our legal officer, re-inforces this, and adds:
"Copyright definitely applies to certain elements of a database. Copyright exists in the structure of a database if, by reason of the selection and arrangement, it constitutes the authors own intellectual creation. In addition the contents of database, depending on what they are, may attract their own copyright protection (a simple example might be a database of poems)."But is there a glimmer of hope? The Science Commons FAQ goes on to say:
"Note - for databases subject to the laws of members of the European Union and certain other countries, the law supplies a special right for databases. Except in the Netherlands and Belgium Creative Commons Licenses, Creative Commons licenses do not apply to this right..."Rob Styles also reminds us that in Europe we have this other right: “the EU adopted a robust database right in 1996 while the US ruled against such protection in 1991”.
“Database right in the EU is like Copyright. It is a monopoly, but only on that particular aggregation of the data. The underlying facts are still not protected and there is nothing to stop a second entrant from collecting them independently.”Charlotte Waelde has written a report for the JISC-funded GRADE project on rights that apply to data in geospatial databases. She concluded that Database Copyright does not apply, but the Database Right does apply. She also concluded (my emphasis):
"• Unauthorised taking and making available of substantial parts of the contents of the database will infringe the right of extraction and re-utilisation"and...
"• A lawful user of the database (e.g. the researcher or teacher in an educational institution) may not be prevented from extracting and re-utilising an insubstantial part of the contents of a database for any purposes whatsoever.I am not a lawyer and (try as I might) I couldn't get all the nuances of what she is trying to say, particularly in the last sentence above; however Mags tells me
• A researcher or teacher may not be prevented from extracting a substantial part of the contents of the database for the purposes of non-commercial research or illustration for teaching so long as the source is indicated. Re-utilisation may only be enjoined if the output contains a substantial part of the contents of the protected database"
"The thing there is that there is a difference between extraction and reutilisation which are the two activities that can be prevented by the database right. The fair dealing exceptions for the database right are not as wide as those of copyright and are for some reason limited to the act of extraction."
"So Charlotte is highlighting the maximum you could do in such case where your activities fall within the research/teaching area. This is: extract a substantial part. And then reutilise an insubstantial part (because the database right only limits what you do with substantial parts of the database)."Rob goes on to end his blog entry, saying of rights:
“They allow inventors to disclose their inventions when they might otherwise have had to keep them secret... That's why we've invested in a license to do this, properly, clearly and in a way that stays Open.”He is referring to the Talis Community Licence, which attempts to base a conditional open licence on the Database Right. Trust me, I REALLY want this sort of thing to work, but I worry that the Database Right may not be sufficient as underlying protection to make this licence firm. And what would be the law applying to access FROM a jurisdiction like the US that did not have a Database Right?
As I’ve said before, I’m not a lawyer. Can a data-oriented lawyer comment?
Wednesday 18 July 2007
Arts and Humanities Data Service... next steps?
In an earlier post, I mentioned the decision by the AHRC (and later JISC) to cease funding the
AHDS from March 2008. Since then the AHRC have re-affirmed their decision. On 28 June, the Future Histories of the Moving Image Research Network made public an open letter to the AHRC, to no avail it would appear.
In her response to the announcements, Sheila Anderson (Director of AHDS) wrote:
Explicit in the AHRC's decision was the view that the community is mature enough to manage its own resources. There is doubt in many people's minds about this, but we are effectively stuck with it. So what are the implications? There are implications both for existing collections and for future arts and humanities resources. I would like to spend a few paragraphs thinking about the existing collections.
AHDS is not monolithic; it is comprised of several separate services (I suspect in what follows I may be using historical rather than current names). We already know that AHRC is privileging the Archaeology Data Service (ADS), which will continue to receive some funding (and which has a diverse funding base), so their resources are presumably safe. The History Data Service (HDS) resources are embedded within the UK Data Archive (which has recently received an additional 5 years funding from ESRC); it would presumably cost more to de-accession those resources than to keep preserving them and making them available, so even if HDS can take no more resources, the existing ones should be safe. Literature, Language and Linguistics is closely related to the Oxford Text Archive; I imagine the same kinds of arguments would apply there.
I have heard suggestions that Kings College London might continue to support the AHDS Executive for a period, and it appears there are some discussions with JISC about some kind of support "to ensure the expertise and achievements of the AHDS are not lost to the community".
That leaves Performing Arts and Visual Arts; I can't even surmise what their future might be, since I don't know enough about their funding and local environment.
I appreciate that it's still early days, and no doubt crucial discussions are going on behind the scenes. But if any part of AHDS resources are in danger of loss, the resource owners need to consider plans to deal with those resources in the future. This will take some time, particularly since for more complex resources, it is clear that existing repositories are generally NOT yet adequate for purpose. I guess the picture itself will be complex; I can think of at least these categories:
Are we OK? Is there more? Who knows! I think we need much better tools to tell what is "at risk", so that plans can start being made. Of course, this could be happening, maybe I'm just not "in the loop".
Will AHRC consider bids for funding transitional work? I certainly hope so, although I don't know how this might be done. JISC is (I believe) planning one last round of its Capital Programme. Will they include provisions to enhance repositories so as to take these more complex resources? I certainly hope so!
AHDS from March 2008. Since then the AHRC have re-affirmed their decision. On 28 June, the Future Histories of the Moving Image Research Network made public an open letter to the AHRC, to no avail it would appear.
In her response to the announcements, Sheila Anderson (Director of AHDS) wrote:
"In the meantime, and at least until 31st March 2008, the AHDS will continue to give advice and guidance on all matters relating to the creation of digital content arising from or supporting research, teaching and learning across the arts and humanities, including technical and metadata standards and project management. If you have a data creation project, please do not hesitate to contact us for advice.This is a great expression of commitment, and deserves our support. However, the lack of long-term funding must raise questions of sustainability.
The AHDS will continue to work with those creating important digital resources to advise on the best methods for keeping these valuable resources available and accessible for the long-term in a form that encourages their further use for answering new research questions, and their use in teaching and learning. This advice will include exploring with content creators and owners suitable repositories in which they might deposit their materials for long term curation and preservation, and how to ensure that their materials can continue to be discovered and used by the wider community. If you are currently in negotiation with the AHDS to deposit your digital collection, please continue to work with us to ensure the future sustainability and accessibility of your resource.
The AHDS will continue to make available its rich collection of digital content for use in research, teaching and learning, and to preserve those collections in its care. The AHDS intends to discuss with the JISC and the AHRC the long term future of these collections beyond April 2008 with the intention of securing their continued preservation and availability."
Explicit in the AHRC's decision was the view that the community is mature enough to manage its own resources. There is doubt in many people's minds about this, but we are effectively stuck with it. So what are the implications? There are implications both for existing collections and for future arts and humanities resources. I would like to spend a few paragraphs thinking about the existing collections.
AHDS is not monolithic; it is comprised of several separate services (I suspect in what follows I may be using historical rather than current names). We already know that AHRC is privileging the Archaeology Data Service (ADS), which will continue to receive some funding (and which has a diverse funding base), so their resources are presumably safe. The History Data Service (HDS) resources are embedded within the UK Data Archive (which has recently received an additional 5 years funding from ESRC); it would presumably cost more to de-accession those resources than to keep preserving them and making them available, so even if HDS can take no more resources, the existing ones should be safe. Literature, Language and Linguistics is closely related to the Oxford Text Archive; I imagine the same kinds of arguments would apply there.
I have heard suggestions that Kings College London might continue to support the AHDS Executive for a period, and it appears there are some discussions with JISC about some kind of support "to ensure the expertise and achievements of the AHDS are not lost to the community".
That leaves Performing Arts and Visual Arts; I can't even surmise what their future might be, since I don't know enough about their funding and local environment.
I appreciate that it's still early days, and no doubt crucial discussions are going on behind the scenes. But if any part of AHDS resources are in danger of loss, the resource owners need to consider plans to deal with those resources in the future. This will take some time, particularly since for more complex resources, it is clear that existing repositories are generally NOT yet adequate for purpose. I guess the picture itself will be complex; I can think of at least these categories:
- Some resources still exist outside of AHDS, and no action may be needed.
- Some resources will not be felt worth re-homing.
- Some resources can be re-homed in the time and funding institutions have available.
- Some resources should be re-homed, provided that time and funding are provided by some external source (this might be for developments on an institutional repository; it might be for work on the resource to fit a new non-AHDS environment).
- HOTBED (Handing on Tradition By Electronic Dissemination) pa-1028-1
- Lemba Archaeological Project arch-279-1
- Gateway to the Archives of Scottish Higher Education (GASHE) exec-1003-1
- Avant-Garde/Neo-Avant-Garde Bibliographic Research Database lll-2503-1
- Survey of Scottish Witchcraft, 1563-1736 hist-4667-1
- National Sample from the 1851 Census of Great Britain hist-1316-1
Are we OK? Is there more? Who knows! I think we need much better tools to tell what is "at risk", so that plans can start being made. Of course, this could be happening, maybe I'm just not "in the loop".
Will AHRC consider bids for funding transitional work? I certainly hope so, although I don't know how this might be done. JISC is (I believe) planning one last round of its Capital Programme. Will they include provisions to enhance repositories so as to take these more complex resources? I certainly hope so!
European e-Science Digital Repository Consultation
Philip Lord wrote to tell me that he and and Alison Macdonald are conducting a study for the European Commission “Towards a European e-Infrastructure for e-Science Digital Repositories”, (e-SciDR) – see www.e-SciDR.eu. This is a short study to summarize the situation regarding repositories in Europe and to propose policies for the Commission for repository development in Europe. As part of the study process the Commission is hosting a public consultation through a questionnaire... The letter inviting participation follows:
"Dear Sir, Dear Madam,[NB Safari on the Mac appears not to work with this questionnaire, but Firefox does.]
May I invite you, as key stakeholders, to contribute to the development of a knowledge society and digital infrastructure in Europe, by taking part in the Commission’s online public consultation on e-Science Digital Repositories which is available at http://ec.europa.eu/yourvoice/ipm/forms/dispatch?form=eSciDR."
"This consultation forms a key part of the e-SciDR study funded by the Commission into repositories holding digital data and publications for use in the sciences (in the widest sense encompassing disciplines from the humanities and social sciences to the life sciences).
Your answers will help identify needs, priorities and opportunities which the European Union, through the Commission, can help address and drive forward in the FP7 Capacity Programme and will provide an important input to developing future policy initiatives.
I would be grateful if you could respond to the consultation by no later than 30 July 2007.
All answers will be strictly confidential and anonymised.
If you would like to receive a summary of the consultation results, please tick the corresponding box on the questionnaire.
Best regards,
Mário Campolargo
Head of Unit GÉANT & e-Infrastructure"
Tuesday 17 July 2007
A little more on very long term time series
I wrote earlier about a visit to Rothamsted Research, to talk with them about some of their very long term time series of agricultural research data (since 1843... digitised since 1991). Asking around, the prevailing wisdom seems to be to break the time series data when the nature of the data changes. Keep those time series, un-touched. Then build your overall time-series by a set of transformations on the original datasets, where the actual transformations are well-documented.
Of course if (as is perhaps the nature of such agricultural experiments) the nature of the data changes pretty well every year, then you have to keep a series of one-year data snapshots. And that sounds pretty much like what Rothamsted's batch "sheets" are doing.
Meanwhile, I'm continuing to look for something a bit more authoritative than "prevailing wisdom"! Someone has promised me a reference to some Norwegian economic or price time series kept since the 16th century, so I'm hopeful! Any hints from readers welcome...
Of course if (as is perhaps the nature of such agricultural experiments) the nature of the data changes pretty well every year, then you have to keep a series of one-year data snapshots. And that sounds pretty much like what Rothamsted's batch "sheets" are doing.
Meanwhile, I'm continuing to look for something a bit more authoritative than "prevailing wisdom"! Someone has promised me a reference to some Norwegian economic or price time series kept since the 16th century, so I'm hopeful! Any hints from readers welcome...
The National Archives and Microsoft join forces...
On 4 July, The National Archives and Microsoft announced a Memorandum of Understanding "ensuring preservation of the nation´s digital records from the past, present and into the future". Partly this relates to the standardisation of the Office Open XML format for Microsoft's products (I note that O'Reilly appear strongly in favour of this activity, seeing no conflict with the standardisation of Open Document Format). I had thought, through the rumour mill, that MS had provided (or was to provide) the specifications of obsolete file formats. This would fit with TNA's PRONOM file format registry, and their Seamless Flow programme. However, the key paragraph instead seems to be:
Should data people care? Apart from the documentation and other information that many will have stored in obsolete proprietary formats, it turns out that many store their data that way as well. The excellent JISC-funded StORe project did a survey of several disciplines, and got over 350 responses. Although one respondent said “God preserve us from idiots who archive data in proprietary commercial formats (Excel spreadsheets and MS-word documents)!”, 220 said they kept source data in spreadsheets and the same number kept them in word processed documents. These were the largest categories except images (228)!
Let's hope they are keeping them somewhere else as well...
"Today´s announcement sees Microsoft provide The National Archives with access to previous versions of Microsoft´s Windows operating systems and Office applications powered by Microsoft Virtual PC 2007. Virtual PC 2007 enables people to run multiple operating systems at the same time on the same computer. This allows The National Archives to configure any combination of Windows and Office from one PC, thereby allowing access to practically any document based on legacy Microsoft file formats. It is estimated that The National Archives will have to manage many terabytes of data in these formats."This sounds as if TNA must keep licensed versions of all older MS products, but then can run them under this emulation mode. This is a LOT better than nothing, but not as good as open access to the old file formats, with the ability to build additional tools that implies. But maybe there's more than is apparent in the press release?
Should data people care? Apart from the documentation and other information that many will have stored in obsolete proprietary formats, it turns out that many store their data that way as well. The excellent JISC-funded StORe project did a survey of several disciplines, and got over 350 responses. Although one respondent said “God preserve us from idiots who archive data in proprietary commercial formats (Excel spreadsheets and MS-word documents)!”, 220 said they kept source data in spreadsheets and the same number kept them in word processed documents. These were the largest categories except images (228)!
Let's hope they are keeping them somewhere else as well...
Monday 16 July 2007
Open Data... Open Season?
Peter Murray Rust is an enthusiastic advocate of Open Data (the discussion runs right through his blog, this link is just to one of his articles that is close to the subject). I understand him to want to make science data openly accessible for scientific access and re-use. It sounds a pretty good thing! Are there significant downsides?
Mags McGinley recently posted in the DCC Blawg about the report "Building the Infrastructure for Data Access and Reuse in Collaborative Research" from the Australian OAK Law project. This report includes a substantial section (Chapter 4) on Current Practices and Attitudes to Data Sharing, which includes 31 examples, many from the genomics and related areas. Peter MR wants a very strong definition of Open Access (defined by Peter Suber as BBB, for Budapest, Bethesda and Berlin, which effectively requires no restrictions on reuse, even commercially). Although licences were often not clear, what could be inferred in these 31 cases generally would probably not fit the BBB definition.
However, buried in the middle of the report is a cautionary tale. Towards the end of chapter 4, there is a section on risks of open data in relation to patents, following on from experiences in the Human Genome and related projects.
The report goes on:
Are there other examples of these kinds of restrictions being imposed? Or of problems ensuing because they have not been imposed, and the data left open? (Note, I'm not at all advocating closed access!)
Mags McGinley recently posted in the DCC Blawg about the report "Building the Infrastructure for Data Access and Reuse in Collaborative Research" from the Australian OAK Law project. This report includes a substantial section (Chapter 4) on Current Practices and Attitudes to Data Sharing, which includes 31 examples, many from the genomics and related areas. Peter MR wants a very strong definition of Open Access (defined by Peter Suber as BBB, for Budapest, Bethesda and Berlin, which effectively requires no restrictions on reuse, even commercially). Although licences were often not clear, what could be inferred in these 31 cases generally would probably not fit the BBB definition.
However, buried in the middle of the report is a cautionary tale. Towards the end of chapter 4, there is a section on risks of open data in relation to patents, following on from experiences in the Human Genome and related projects.
"Claire Driscoll of the NIH describes the dilemma as follows:(The reference given is Claire T Driscoll, ‘NIH data and resource sharing, data release and intellectual property policies for genomics community resource projects’ Expert Opin. Ther. Patents (2005) 15(1), 4)
It would be theoretically possible for an unscrupulous company or entity to add on a trivial amount of information to the published…data and then attempt to secure ‘parasitic’ patent claims such that all others would be prohibited from using the original public data."
The report goes on:
"Consequently, subsequent research projects relied on licensing methods in an attempt to restrict the development of intellectual property in downstream discoveries based on the disclosed data, rather than simply releasing the data into the public domain."They then discuss the HapMap (International Haplotype) project, which attempted to make data available while restricting the possibilities for parasitic patenting.
"Individual genotypes were made available on the HapMap website, but anyone seeking to use the research data was first required to register via the website and enter into a click-wrap licence for the use of the data. The licence entered into, the International HapMap Project Public Access Licence, was explicitly modeled on the General Public Licence (GPL) used by open source software developers. A central term of the licence related to patents. It allowed users of the HapMap data to file patent applications on associations they uncovered between particular SNP data and disease or disease susceptibility, but the patent had to allow further use of the HapMap data. The licence specifically prohibited licensees from combining the HapMap data with their own in order to seek product patents..."Checking HapMap, the Project's Data Release Policy describes the process, but the link to the Click-Wrap agreement says that the data is now open. See also the NIH press release). There were obvious problems, in that the data could not be incorporated into more open databases. The turning point for them seems to be:
"...advances led the consortium to conclude that the patterns of human genetic variation can readily be determined clearly enough from the primary genotype data to constitute prior art. Thus, in the view of the consortium, derivation of haplotypes and 'haplotype tag SNPs' from HapMap data should be considered obvious and thus not patentable. Therefore, the original reasons for imposing the licensing requirement no longer exist and the requirement can be dropped."So, they don't say the threat does not exist from all such open data releases, but that it was mitigated in this case.
Are there other examples of these kinds of restrictions being imposed? Or of problems ensuing because they have not been imposed, and the data left open? (Note, I'm not at all advocating closed access!)
Government responds to UK Science Funding petition
8,623 people signed the petition to the UK Government on the £68 million funding reduction to science. The Government has just published a response. After some phrases indicating that the science budget continues to rise, the key paragraph is:
The original petition asked the Government to "revise research funding via the DTI", or alternatively "I wish the Government to review its recent decision outlined [in the text]...". I guess that's a NO to the first but perhaps a yes to the second (a review doesn't necessarily mean a change).
"The Department of Trade and Industry had been facing a number of new and historic budgetary pressures which required action to keep within its budgets. Non-ring-fenced budgets had been reduced as far as possible, so the Department then had to consider its ringfenced budgets, including the Science Budget. It was decided to use part of the underspends in the Science budget that had been accumulated in previous years. This decision did not affect either the 2006-07 budget allocations, or the 2007-08 budget allocations , nor did it affect the commitments set out in the 10 Year Science and Investment Framework."I suppose accumulated underspends might be in neither the 2006-07 budget nor in the 2007-08 budget, but the impact of the reduction has definitely been felt in the 2007-08 year!
The original petition asked the Government to "revise research funding via the DTI", or alternatively "I wish the Government to review its recent decision outlined [in the text]...". I guess that's a NO to the first but perhaps a yes to the second (a review doesn't necessarily mean a change).
Thursday 12 July 2007
Very long term data
Rothamsted Research is an agricultural research organisation based near Harpenden in England. There are many interesting features of this organisation (only a few of which I know), including its “classical experiments". One of these, started in 1843, must surely be one of the longest-running experiments with resulting time-series data anywhere. I visited this week, spoke to Chris Rawlings and others handling the data for this “Broadbalk” experiment on wheat yields, and also a couple of scientists working on somewhat younger experiments collecting moths (1930s) and aphids (1960s) on a daily basis (using light traps and vacuum traps respectively).
Digital preservation theory tells us that digital data are at risk from various kinds of changes in the environment. Most often we focus on the media, or the risk of format incompatibility. OAIS rightly asks us to think about contextual metadata of various kinds, but also recognises that there is risk of semantic drift and/or semantic loss rendering once-clear resources incomprehensible (this is the business about the Designated Community and its Knowledge Base that I wrote about recently). The OAIS model seems to me (though some argue against this view) premised on a preservation or perhaps archival view: resources are ingested, preserved within the archive, and then disseminated at a later date. I’ve always had doubts about how easily this fits with a more continuous curation model, where the data are ingested, managed, preserved and disseminated simultaneously.
However, intuitively even in the curation situation I have described, if “long enough: time passes, many of the concerns of digital preservation will apply. And the Broadbalk wheat yield experiment since 1843 is certainly long enough. In that time there has been semantic drift (words mean different things), changes in units (imperial to metric), changes in plot size/granularity and plot labelling more than once, changes in what is measured (eg dropping wet hay weight while keeping dry yield) and how it is measured (the whole plot or a sample), changes in accuracy of measurement, and many changes in treatments. These are very serious changes that could significantly affect interpretation and analysis.
In the case of the aphids experiment, I saw a log book in which changes in interpretation etc were meticulously recorded. Unfortunately I don't have a copy of any pages from that book to describe in more detail the kinds of issues it raised. Although weekly aphid bulletins are made available (explicitly not "published"!), the book itself has, I believe, not been published. My guess is that this book forms a critical part of the provenance information by which the quality of the data can be judged.
Of course, the Broadbalk experiment has not been digital since 1843. I didn’t see the records, but the early ones would have been longhand in ledgers or notebooks, and later perhaps typed up. I get the impression it was quite systematic from the beginning, so converting it to an Oracle database in around 1991 was a feasible if major task. Now they are planning to convert it to a new database system and perhaps make the data more widely available, so they are asking questions about how they should better handle these changes.
By the way, in one respect the experiments continue to be decidedly non-digital. Since the beginning, they have been collecting soil samples from the plots, and now have over 10,000 of them! This makes a fascinating physical collection as a reference comparator for the digital records.
I should say that they have done a lot of hard thinking and good work about some of these issues. In particular, they have arranged all their input data into batches (known as “sheets”), which sounds pretty much self-describing in what is effectively a purpose-built data description language. This means they can roll back and roll forward their databases using different approaches. These sheets are all prepared using the assumptions of their time (remembering that some were prepared more than a hundred years after the data they contain was first recorded), but in theory there is enough information about these assumptions to make decisions. So once they have decided how to deal with some of these changes, they are able to do the best job possible of implementing it.
A Swedish forestry scientist once argued forcefully, in the discussion session to a presentation, that the original observations were sacrosanct, must be kept and should not be changed. Do what you like with the analysis, he suggested: re-run it with varied parameters, re-analyse with new models, (implicitly) make the sorts of changes suggested here. This approach would suggest starting a new time series dataset whenever there is a change of the sort we have been describing. That’s not exactly what has been happening with the Broadbalk experiment, but the sheets approach is if anything even finer grained.
I’m interested to collate experience from other long-running experiments that may have faced these and related issues. So far I have heard of the Harvard Forest project, part of the NSF Long Term Ecological Research (LTER) Network (thanks Raj Bose), and some very long-running birth cohort studies in the social sciences. Further suggestions on comparators and literature to look at would be welcome.
Digital preservation theory tells us that digital data are at risk from various kinds of changes in the environment. Most often we focus on the media, or the risk of format incompatibility. OAIS rightly asks us to think about contextual metadata of various kinds, but also recognises that there is risk of semantic drift and/or semantic loss rendering once-clear resources incomprehensible (this is the business about the Designated Community and its Knowledge Base that I wrote about recently). The OAIS model seems to me (though some argue against this view) premised on a preservation or perhaps archival view: resources are ingested, preserved within the archive, and then disseminated at a later date. I’ve always had doubts about how easily this fits with a more continuous curation model, where the data are ingested, managed, preserved and disseminated simultaneously.
However, intuitively even in the curation situation I have described, if “long enough: time passes, many of the concerns of digital preservation will apply. And the Broadbalk wheat yield experiment since 1843 is certainly long enough. In that time there has been semantic drift (words mean different things), changes in units (imperial to metric), changes in plot size/granularity and plot labelling more than once, changes in what is measured (eg dropping wet hay weight while keeping dry yield) and how it is measured (the whole plot or a sample), changes in accuracy of measurement, and many changes in treatments. These are very serious changes that could significantly affect interpretation and analysis.
In the case of the aphids experiment, I saw a log book in which changes in interpretation etc were meticulously recorded. Unfortunately I don't have a copy of any pages from that book to describe in more detail the kinds of issues it raised. Although weekly aphid bulletins are made available (explicitly not "published"!), the book itself has, I believe, not been published. My guess is that this book forms a critical part of the provenance information by which the quality of the data can be judged.
Of course, the Broadbalk experiment has not been digital since 1843. I didn’t see the records, but the early ones would have been longhand in ledgers or notebooks, and later perhaps typed up. I get the impression it was quite systematic from the beginning, so converting it to an Oracle database in around 1991 was a feasible if major task. Now they are planning to convert it to a new database system and perhaps make the data more widely available, so they are asking questions about how they should better handle these changes.
By the way, in one respect the experiments continue to be decidedly non-digital. Since the beginning, they have been collecting soil samples from the plots, and now have over 10,000 of them! This makes a fascinating physical collection as a reference comparator for the digital records.
I should say that they have done a lot of hard thinking and good work about some of these issues. In particular, they have arranged all their input data into batches (known as “sheets”), which sounds pretty much self-describing in what is effectively a purpose-built data description language. This means they can roll back and roll forward their databases using different approaches. These sheets are all prepared using the assumptions of their time (remembering that some were prepared more than a hundred years after the data they contain was first recorded), but in theory there is enough information about these assumptions to make decisions. So once they have decided how to deal with some of these changes, they are able to do the best job possible of implementing it.
A Swedish forestry scientist once argued forcefully, in the discussion session to a presentation, that the original observations were sacrosanct, must be kept and should not be changed. Do what you like with the analysis, he suggested: re-run it with varied parameters, re-analyse with new models, (implicitly) make the sorts of changes suggested here. This approach would suggest starting a new time series dataset whenever there is a change of the sort we have been describing. That’s not exactly what has been happening with the Broadbalk experiment, but the sheets approach is if anything even finer grained.
I’m interested to collate experience from other long-running experiments that may have faced these and related issues. So far I have heard of the Harvard Forest project, part of the NSF Long Term Ecological Research (LTER) Network (thanks Raj Bose), and some very long-running birth cohort studies in the social sciences. Further suggestions on comparators and literature to look at would be welcome.
Labels:
DCC,
Digital Curation,
Long Term,
Preservation
Open Access everyone? Or not?
I occasionally look at the OpenDOAR service, which list information about repositories, and check out those which claim to include data (the term they use is datasets, although it is possible that “other” might also be applicable!). Last time I looked there were 7 listed in the UK which claimed to collect datasets; 4 of these were institutional repositories, one is the Southampton eCrystals archive, the 6th is the National Digital Archive of Datasets (NDAD), which collects government datasets on behalf of The National Archive, and the 7th is Nature Precedings.
I used to think that was a creative piece of listing by NDAD, since OpenDOAR was created for institutional repositories, wasn’t it? But as time has gone on, I’ve begun to think NDAD are right; the old definitions of repositories were much too tight, and NDAD and many other data archives ought to be considered as repositories, and listed in these kind of resources. On that basis perhaps AHDS should have been listed (although they might not bother now) and, I thought, the UK Data Archive, part of the Economic and Social Data Service, should also be listed. After all, they are repositories, and open access is a good thing, isn’t it?
At the UKDA’s 40th birthday celebrations yesterday, it was clear that Kevin Schurer (Director of UKDA) doesn’t quite share that view. The view expressed certainly was not a total rejection of Open Access in favour of a commercial approach. But he was certainly arguing for some barriers between some data and the users. In particular, while access to UKDA data is free for certain categories of users (certainly UK academics and probably some others), ALL users are required to register. Kevin made a strong case for this being an advantage; registration means that he can report to his funders who his users are (including how many of them might be independent or Government researchers as well as academic ones), and can also monitor which datasets they are using. He knows, for example, that the user base has grown from 25 or so after the first 5 years to 45,000 (I don't remember the exact figure, but of that order) after the first 40 years.
The same registration mechanism (and associated authentication and authorisation mechanisms) also allow them to apply greater access controls to more sensitive datasets, including the possibility that the user may have to sign and observe special licences. It’s worth remembering that much of the data they hold is about people, and some is extremely sensitive information, which has been provided (under “informed consent”) for certain specific purposes.
In a very different environment the day before, I met some scientists with very long-term experiments collecting data on insects. For perhaps different reasons, they were happy to make their data available to collaborators, but not openly available on the web. Too many risks of mis-interpretation, which then requires extra effort to refute, was one reason. No doubt an extra paper as co-author from a collaboration was another motive (not unreasonable).
Are these approaches Open? Are they consistent with the OECD Principles and Guidelines for Access to Research Data that I wrote about earlier? Let’s remember that the Openness Principle was carefully worded:
What the UKDA approach might do is make certain kinds of comparative work, including automated data mining, more difficult. There was a plea with respect to the former, from an Australian speaker, for more sophisticated international cross-archive access management (code: Shibboleth-enable). But I suspect with respect to the latter (preventing data mining) Kevin might argue “very right and proper too”!
I used to think that was a creative piece of listing by NDAD, since OpenDOAR was created for institutional repositories, wasn’t it? But as time has gone on, I’ve begun to think NDAD are right; the old definitions of repositories were much too tight, and NDAD and many other data archives ought to be considered as repositories, and listed in these kind of resources. On that basis perhaps AHDS should have been listed (although they might not bother now) and, I thought, the UK Data Archive, part of the Economic and Social Data Service, should also be listed. After all, they are repositories, and open access is a good thing, isn’t it?
At the UKDA’s 40th birthday celebrations yesterday, it was clear that Kevin Schurer (Director of UKDA) doesn’t quite share that view. The view expressed certainly was not a total rejection of Open Access in favour of a commercial approach. But he was certainly arguing for some barriers between some data and the users. In particular, while access to UKDA data is free for certain categories of users (certainly UK academics and probably some others), ALL users are required to register. Kevin made a strong case for this being an advantage; registration means that he can report to his funders who his users are (including how many of them might be independent or Government researchers as well as academic ones), and can also monitor which datasets they are using. He knows, for example, that the user base has grown from 25 or so after the first 5 years to 45,000 (I don't remember the exact figure, but of that order) after the first 40 years.
The same registration mechanism (and associated authentication and authorisation mechanisms) also allow them to apply greater access controls to more sensitive datasets, including the possibility that the user may have to sign and observe special licences. It’s worth remembering that much of the data they hold is about people, and some is extremely sensitive information, which has been provided (under “informed consent”) for certain specific purposes.
In a very different environment the day before, I met some scientists with very long-term experiments collecting data on insects. For perhaps different reasons, they were happy to make their data available to collaborators, but not openly available on the web. Too many risks of mis-interpretation, which then requires extra effort to refute, was one reason. No doubt an extra paper as co-author from a collaboration was another motive (not unreasonable).
Are these approaches Open? Are they consistent with the OECD Principles and Guidelines for Access to Research Data that I wrote about earlier? Let’s remember that the Openness Principle was carefully worded:
“Openness means access on equal terms for the international research community at the lowest possible cost, preferably at no more than the marginal cost of dissemination. Open access to research data from public funding should be easy, timely, user-friendly and preferably Internet-based.”So it looks as though the approach is reasonably consistent, provided access is on “equal terms for the international research community”, by that definition. I suspect there are other definitions (who said “the nice thing about standards is there are so many to choose from”?) by which these approaches would fail (eg the Berlin Declaration).
What the UKDA approach might do is make certain kinds of comparative work, including automated data mining, more difficult. There was a plea with respect to the former, from an Australian speaker, for more sophisticated international cross-archive access management (code: Shibboleth-enable). But I suspect with respect to the latter (preventing data mining) Kevin might argue “very right and proper too”!
Wednesday 11 July 2007
UKDA 40th Birthday: back to basics?
There are not many digital data management organisations that can claim 40 years of continuous service. This week the UK Data Archive (UKDA, which holds mainly social science datasets) celebrates 40 years since its founding in 1967, with a party yesterday in the House of Commons followed by a small workshop in the UKDA’s fancy new quarters at the University of Essex. The DCC would like to congratulate the UKDA, Director Kevin Schurer and his 6 predecessor Directors, and the UKDA staff, on achieving this milestone.
[Pause to suppress pangs of envy at such longevity. Sigh!]
There were many very interesting aspects to this workshop, but here I will focus on one particular contribution, during the closing sessions, from Myron Gutmann. Myron is Director of ICPSR, a roughly equivalent organisation in the US, based at Michigan. Myron said he wanted to argue for retaining the basics. A data archive, he said, should do (at least) 5 things:
It’s a pretty good summary of the job of an archive, give or take a verb or two. Myron added something like “serve new user bases” and “innovate” in various ways, and it’s hard to argue with.
There were some sober reflections on sustainability at the workshop, partly relating to the difficult funding position for long-term longitudinal surveys, but partly reflecting the recent AHRC decisions. Paraphrasing Kevin Schurer, we never could or should take things for granted, but now we must be doubly persuasive, even if the UKDA's main funder ESRC regards ESDS (of which UKDA is a key part) as “a jewel in its crown”.
[Pause to suppress pangs of envy at such longevity. Sigh!]
There were many very interesting aspects to this workshop, but here I will focus on one particular contribution, during the closing sessions, from Myron Gutmann. Myron is Director of ICPSR, a roughly equivalent organisation in the US, based at Michigan. Myron said he wanted to argue for retaining the basics. A data archive, he said, should do (at least) 5 things:
- Appraise
- Curate
- Preserve
- Train
- Protect
It’s a pretty good summary of the job of an archive, give or take a verb or two. Myron added something like “serve new user bases” and “innovate” in various ways, and it’s hard to argue with.
There were some sober reflections on sustainability at the workshop, partly relating to the difficult funding position for long-term longitudinal surveys, but partly reflecting the recent AHRC decisions. Paraphrasing Kevin Schurer, we never could or should take things for granted, but now we must be doubly persuasive, even if the UKDA's main funder ESRC regards ESDS (of which UKDA is a key part) as “a jewel in its crown”.
Friday 6 July 2007
On blog authorship and (un)certainty
There's a difference between a blog and an article, and it seems to me it's about certainty. Why should I write this blog? It could be to record trivial events, it could be self-aggrandisement, but I think it's about dealing with uncertainty. If I were fully convinced about the detail, I guess I would write an article and submit it to a journal. But generally I'm speculating a bit; trying to focus my mind by writing something out as clearly as I can for an unknown audience (an audience that has the power to answer back). There's also a nice comment in David Rosenthal's blog: "Taking small, measurable steps quickly is vastly more productive than taking large steps slowly, especially when the value of the large step takes even longer to become evident".
I know I'm writing things that colleagues may not agree with. Sometimes, I expect the process of writing to move me closer to their ideas. Sometimes I might hope they move closer to my ideas. Sometimes we may have to learn to live with our differences. The point for me is to generate some strange kind of conversation, not just with close colleagues but to get some feedback from colleagues interested in data curation across the world.
If I can work out what is causing confusion, articulate it and resolve it, perhaps I can stop being "confused of Kenilworth"!
I know I'm writing things that colleagues may not agree with. Sometimes, I expect the process of writing to move me closer to their ideas. Sometimes I might hope they move closer to my ideas. Sometimes we may have to learn to live with our differences. The point for me is to generate some strange kind of conversation, not just with close colleagues but to get some feedback from colleagues interested in data curation across the world.
If I can work out what is causing confusion, articulate it and resolve it, perhaps I can stop being "confused of Kenilworth"!
Representation Information: what is it and why is it important?
Representation Information is a key and often misunderstood concept. To understand it, we need to look at some definitions. First of all, OAIS (CCSDS 2002) defines data thus:
Now we have a key complication:
Secondly, the introduction of the Designated Community and its Knowledge Base may be both helpful and problematic. It may be helpful because it can reduce the amount of Representation Information needed to interpret data (or even eliminate it completely). This is if the Designated Community is defined as having a Knowledge Base that allows it to understand the data, then nothing more is required. This is obviously never entirely true, and in practice even with a Designated Community that is quite strongly familiar with the data, we will expect to need some RI, perhaps to identify the particular meaning of some variables, etc.
The problematic nature arises because we now have two external concepts, the Designated Community and its Knowledge Base that influence what we must create, and which will change and must be monitored. I’ve heard the words “precise definition” used in the context of these two terms, but I am sceptical anyone can define either precisely (although the LOCKSS Statement of Conformance with OAIS has a minimalistic go; it's the only public one I could find, but I would love to see more). My colleague David Giaretta suggests that his huge project CASPAR aims to produce better definitions.
In fact, although they may be useful ideas, both the Designated Community and its Knowledge Base seem to be quite worrying terms. The best we can say is that “chemists” (for example) understand “”chemical concepts”, and that the latter have proved pretty stable at least in basic forms. But the community of chemists turns out to include a myriad of sub-disciplines, with their own subtleties of terminology, and not surprisingly introducing new concepts and abandoning old ones all the time. If we have some chemical data in our repository, we have to watch out for these concepts going from current through obsolescent, obsolete to arcane, and in theory we have to add RI at each change, to make up for the increasing gap in understanding.
The third interesting feature is that these definitions say nothing about files or file formats at all, yet “format registries” are the most common response to meeting the need for RI. TNA’s PRONOM, and the Harvard/OCLC Global Digital Format Registry (GDFR) are the two best-known examples.
Clearly files and file formats play a critical role in digital preservation. Sometimes I think this has occurred because of the roots of much of digital preservation (although not OAIS) lie in the library and cultural heritage communities, dominated as they are by complex proprietary file formats like Microsoft Word. In science, formats are probably much simpler overall, but other aspects may be more critical to “understand” (ie use in a computation) the data.
The best example I know to illustrate the difference between file format information and RI is to imagine a social science survey dataset encoded with SPSS. We may have all the capabilities required to interpret SPSS files, but still not be able to make sense of the dataset if we do not know the meaning of the variables, or do not have access to the original questionnaires. Both the latter would qualify as RI. Database schemas may provide another example of RI.
Have I shown why or how RI is a useful concept in digital curation? I'm not sure, but at least there's a start. Representation Information, as David Giaretta sometimes says, is useful for interpreting unfamiliar data!
In later posts, I’m going to try to include some specific examples of RI that relates to science data. I also intend to try to justify more strongly the role of RI in curation rather than preservation, ie through life rather than just at the end of it!
* CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.
* FLORIDI, L. (2005) Is Semantic Information Meaningful Data? Philosophy and Phenomenological Research, 70, 351-370. http://www.ingentaconnect.com/content/ips/ppr/2005/00000070/00000002/art00004
“Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”Second, we have Information:
“Information: Any type of knowledge that can be exchanged. In an exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret a string of bits as numbers representing temperature observations measured in degrees Celsius (the representation information).”Then we have Representation Information (sometimes abbreviated as RI):
“Representation Information: The information that maps a Data Object into more meaningful concepts. An example is the ASCII definition that describes how a sequence of bits (i.e., a Data Object) is mapped into a symbol.”As an example, we have this paragraph:
"Information is defined as any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some type of data. For example, the information in a hardcopy book is typically expressed by the observable characters (the data) which, when they are combined with a knowledge of the language used (the Knowledge Base), are converted to more meaningful information. If the recipient does not already include English in its Knowledge Base, then the English text (the data) needs to be accompanied by English dictionary and grammar information (i.e., Representation Information) in a form that is understandable using the recipient’s Knowledge Base.”The summary is that “Data interpreted using its Representation Information yields Information”.
Now we have a key complication:
“Since a key purpose of an OAIS is to preserve information for a Designated Community, the OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained. The OAIS should then make a decision between maintaining the minimum Representation Information needed for its Designated Community, or maintaining a larger amount of Representation Information that may allow understanding by a larger Consumer community with a less specialized Knowledge Base. Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding.“So now we need another couple of definitions:
“Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.”
“Knowledge Base: A set of information, incorporated by a person or system, that allows that person or system to understand received information.”So there are several interesting things here. First is that this obviously enshrines a particular understanding of information; one I couldn’t find in Wikipedia when I last looked (here is the article at that time; may be it will be there next time!). Floridi suggests there is no commonly accepted definition of information, and that it is polysemantic, and particularly contrasts information in Shannon’s Mathematical Theory of Communication with the “Standard Definition of Information” (Floridi, 2005). If I understand it rightly, the latter refers to factual information (with some controversy on whether it need be true), but not necessarily to instructional information (“how”).
Secondly, the introduction of the Designated Community and its Knowledge Base may be both helpful and problematic. It may be helpful because it can reduce the amount of Representation Information needed to interpret data (or even eliminate it completely). This is if the Designated Community is defined as having a Knowledge Base that allows it to understand the data, then nothing more is required. This is obviously never entirely true, and in practice even with a Designated Community that is quite strongly familiar with the data, we will expect to need some RI, perhaps to identify the particular meaning of some variables, etc.
The problematic nature arises because we now have two external concepts, the Designated Community and its Knowledge Base that influence what we must create, and which will change and must be monitored. I’ve heard the words “precise definition” used in the context of these two terms, but I am sceptical anyone can define either precisely (although the LOCKSS Statement of Conformance with OAIS has a minimalistic go; it's the only public one I could find, but I would love to see more). My colleague David Giaretta suggests that his huge project CASPAR aims to produce better definitions.
In fact, although they may be useful ideas, both the Designated Community and its Knowledge Base seem to be quite worrying terms. The best we can say is that “chemists” (for example) understand “”chemical concepts”, and that the latter have proved pretty stable at least in basic forms. But the community of chemists turns out to include a myriad of sub-disciplines, with their own subtleties of terminology, and not surprisingly introducing new concepts and abandoning old ones all the time. If we have some chemical data in our repository, we have to watch out for these concepts going from current through obsolescent, obsolete to arcane, and in theory we have to add RI at each change, to make up for the increasing gap in understanding.
The third interesting feature is that these definitions say nothing about files or file formats at all, yet “format registries” are the most common response to meeting the need for RI. TNA’s PRONOM, and the Harvard/OCLC Global Digital Format Registry (GDFR) are the two best-known examples.
Clearly files and file formats play a critical role in digital preservation. Sometimes I think this has occurred because of the roots of much of digital preservation (although not OAIS) lie in the library and cultural heritage communities, dominated as they are by complex proprietary file formats like Microsoft Word. In science, formats are probably much simpler overall, but other aspects may be more critical to “understand” (ie use in a computation) the data.
The best example I know to illustrate the difference between file format information and RI is to imagine a social science survey dataset encoded with SPSS. We may have all the capabilities required to interpret SPSS files, but still not be able to make sense of the dataset if we do not know the meaning of the variables, or do not have access to the original questionnaires. Both the latter would qualify as RI. Database schemas may provide another example of RI.
Have I shown why or how RI is a useful concept in digital curation? I'm not sure, but at least there's a start. Representation Information, as David Giaretta sometimes says, is useful for interpreting unfamiliar data!
In later posts, I’m going to try to include some specific examples of RI that relates to science data. I also intend to try to justify more strongly the role of RI in curation rather than preservation, ie through life rather than just at the end of it!
* CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.
* FLORIDI, L. (2005) Is Semantic Information Meaningful Data? Philosophy and Phenomenological Research, 70, 351-370. http://www.ingentaconnect.com/content/ips/ppr/2005/00000070/00000002/art00004
Authenticity across migrations
I discovered a few days ago that I have 4 digital objects that are (I believe, but am not certain) in some strong senses “the same” (in their information content), but which are also completely different (in their bits). These objects are the result of a chain of “exports” and “imports”, and “save as…” operations, prompted partly by a change of technology (from a Windows PC running Mind Manager to a Macintosh running NovaMind), and partly from a need to make the content of the object more accessible to colleagues who do not use either software package. In case you're interested, these files are now accessible at the URLs below; I have shown the original date for each file:
15/10/2004: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20and%20action.mmp
09/08/2005: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20&%20action.xml
06/10/2005: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20and%20action.pdf
10/01/2006: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20&%20action.nmind
I’m very interested in the question of authenticity across migrations. Whilst migrations for preservation purposes are perhaps likely to be needed much less often than we once thought, they are nevertheless inevitable given long enough time. How can we plausibly assert that these objects represent “the same thing”?
In this case, each of these objects represents a simple document, that is they can each be transformed into a 2-D representation. Three of the objects have other capabilities: with the appropriate software they can be edited, by adding or removing elements, and by moving elements in relation to one another. This is true to some extent of the fourth object (a PDF, which for most people can be viewed to see a representation of the object), but in a different and much more restricted way. However, this editability, while vital for some kinds of re-use, is not essential for conveying the information essence of the objects.
These documents happen to be mind maps. The PDF should help some of those unfamiliar with this type of document. Mind mapping is a particular approach to organising ideas that originated from Tony Buzan (Buzan 1974). I find the approach particularly useful and productive; however I suspect most of my colleagues groan inwardly when yet another mind map is produced for them to consider. They are a little lightweight as communications devices (and so may have rather limited value as records), yet I persist in thinking they may have value for others. An interesting feature here, however, is that there is no real standardisation in this market. Mind Manager (from MindJet) may perhaps have the market share to force some degree of standardisation (hence the ability of NovaMind and at least one of the Open Source versions that I am aware of , ie FreeMind, to import from Mind Manager XML files). This fractured market led to my having these 4 versions.
I realise that this blog post runs a real risk of being dismissed because of a naïve and undefined approach to “sameness”, and that this idea of sameness, is likely to be linked to the “significant properties” of the object; further that these significant properties will differ for different users. I’ll try to re-visit these ideas later (for now, note a report by Hedstrom and Lee, and a recent ITT from JISC).
One approach to asserting authenticity in this case might be to print out or view each document, and use a visual comparison to assert sameness. It’s worth noting though that I cannot use this technique easily at the moment across the full set, because I no longer have access to the application software that will render the first of the objects (although I believe the test could be done; the object is not obsolete). Furthermore, although the comparison is feasible for such a small number of objects, it is not scalable to the large numbers of objects that one would expect to find in a typical digital repository environment.
One other technique could perhaps be to record provenance data on the processes involved in the migration. In this case the object was created with an old version of Mind Manager, and then exported as a Mind Manager XML file once the conversion to Mac had occurred, and it appeared that no Mac version of Mind Manager was likely (although two years later one exists). This XML version was created explicitly as a migration stage, because it could be read an imported by the Mac application of choice, the software NovaMind. At some point, a PDF version was created to allow the object to be emailed to colleagues who did not have either application software. Shorn of dates and evidence, that is the provenance chain for these 4 objects… except I have a niggling concern that I might have changed the first Novamind version after importing it. A true provenance chain would record ALL the changes to the object; in this case, I have found a way of checking, but I’ll leave it as an exercise for the reader to work out whether it was changed or not! The problem is that this provenance chain only tells us what happened at a gross level (what large scale operations were applied), not whether the migrations were successful, nor what artefacts were introduced or features lost.
Perhaps there is a way, maybe linked to particular families or genres of document types, of computing properties from documents that we would strongly wish to see as stable across migrations. In this case, the text labels on the arms of the mind maps might be one such desirable invariant; quite a strong one, but not perfect (for example, the text labels may show as invariant but might have become detached from their branches, destroying the meaning in their relationships, while more adventurous mind mappers use much more in the way of imagery and other techniques to convey part of their message).
However, if we are to plausibly assert authenticity across migrations, we will have to identify some such invariants. They may be extremely important parts of the preservation metadata, just as checksums and signatures are for checking the authenticity of objects that we believe have not been changed. Any suggestions on candidate invariants for some object classes?
* BUZAN, T. (1974) Use Your Head, BBC.
15/10/2004: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20and%20action.mmp
09/08/2005: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20&%20action.xml
06/10/2005: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20and%20action.pdf
10/01/2006: http://www.dcc.ac.uk/docs/blog/GWG%20vision%20&%20action.nmind
I’m very interested in the question of authenticity across migrations. Whilst migrations for preservation purposes are perhaps likely to be needed much less often than we once thought, they are nevertheless inevitable given long enough time. How can we plausibly assert that these objects represent “the same thing”?
In this case, each of these objects represents a simple document, that is they can each be transformed into a 2-D representation. Three of the objects have other capabilities: with the appropriate software they can be edited, by adding or removing elements, and by moving elements in relation to one another. This is true to some extent of the fourth object (a PDF, which for most people can be viewed to see a representation of the object), but in a different and much more restricted way. However, this editability, while vital for some kinds of re-use, is not essential for conveying the information essence of the objects.
These documents happen to be mind maps. The PDF should help some of those unfamiliar with this type of document. Mind mapping is a particular approach to organising ideas that originated from Tony Buzan (Buzan 1974). I find the approach particularly useful and productive; however I suspect most of my colleagues groan inwardly when yet another mind map is produced for them to consider. They are a little lightweight as communications devices (and so may have rather limited value as records), yet I persist in thinking they may have value for others. An interesting feature here, however, is that there is no real standardisation in this market. Mind Manager (from MindJet) may perhaps have the market share to force some degree of standardisation (hence the ability of NovaMind and at least one of the Open Source versions that I am aware of , ie FreeMind, to import from Mind Manager XML files). This fractured market led to my having these 4 versions.
I realise that this blog post runs a real risk of being dismissed because of a naïve and undefined approach to “sameness”, and that this idea of sameness, is likely to be linked to the “significant properties” of the object; further that these significant properties will differ for different users. I’ll try to re-visit these ideas later (for now, note a report by Hedstrom and Lee, and a recent ITT from JISC).
One approach to asserting authenticity in this case might be to print out or view each document, and use a visual comparison to assert sameness. It’s worth noting though that I cannot use this technique easily at the moment across the full set, because I no longer have access to the application software that will render the first of the objects (although I believe the test could be done; the object is not obsolete). Furthermore, although the comparison is feasible for such a small number of objects, it is not scalable to the large numbers of objects that one would expect to find in a typical digital repository environment.
One other technique could perhaps be to record provenance data on the processes involved in the migration. In this case the object was created with an old version of Mind Manager, and then exported as a Mind Manager XML file once the conversion to Mac had occurred, and it appeared that no Mac version of Mind Manager was likely (although two years later one exists). This XML version was created explicitly as a migration stage, because it could be read an imported by the Mac application of choice, the software NovaMind. At some point, a PDF version was created to allow the object to be emailed to colleagues who did not have either application software. Shorn of dates and evidence, that is the provenance chain for these 4 objects… except I have a niggling concern that I might have changed the first Novamind version after importing it. A true provenance chain would record ALL the changes to the object; in this case, I have found a way of checking, but I’ll leave it as an exercise for the reader to work out whether it was changed or not! The problem is that this provenance chain only tells us what happened at a gross level (what large scale operations were applied), not whether the migrations were successful, nor what artefacts were introduced or features lost.
Perhaps there is a way, maybe linked to particular families or genres of document types, of computing properties from documents that we would strongly wish to see as stable across migrations. In this case, the text labels on the arms of the mind maps might be one such desirable invariant; quite a strong one, but not perfect (for example, the text labels may show as invariant but might have become detached from their branches, destroying the meaning in their relationships, while more adventurous mind mappers use much more in the way of imagery and other techniques to convey part of their message).
However, if we are to plausibly assert authenticity across migrations, we will have to identify some such invariants. They may be extremely important parts of the preservation metadata, just as checksums and signatures are for checking the authenticity of objects that we believe have not been changed. Any suggestions on candidate invariants for some object classes?
* BUZAN, T. (1974) Use Your Head, BBC.
Subscribe to:
Posts (Atom)