Friday, 20 July 2007

Subject "versus" institutional repositories

There's a concept in maths called "closed but unbounded". I'm not sure it's exactly to the point (I hope that's a pun), but "subjects" seem a bit like that. You can be pretty sure about most of the stuff that's not in a subject (or "domain"), and most of the stuff that is in it, but you can be very puzzled about some of the edges, and can find yourself in some extremely surprising discussions at times about parts of subjects that challenge most of the ideas you had. So subjects turn out to be very un-bounded. (They also tend to fracture, productively.) Perhaps not surprisingly, subjects don't tend to have assets, bank balances, etc. You might say, in those senses, subjects don't exist! They do nevertheless have very real approaches, common standards, ontologies, methods, vocabularies, literatures... and passionate adherents spread across institutions.

Institutions on the other hand, or at least universities, tend to be very material. They do have assets, bank balances, policies, libraries, employees, continuity on a significant scale, even (in the US at least) endowments. They have temporal stability and mass. They collect scholars and scientists in various domains... even if the scientists give their loyalty to their subjects, and are held together only by salaries and a common loathing of the university car parking policy!

Institutions have continuity, and they have libraries, and archives, which in serious ways express that continuity. Libraries are not about print. Libraries are now squarely about knowledge and information expressed in data, whether they know it or not. And the continuity of valuable data is an important reason for libraries to be involved.

But institutions are generic, and libraries are generic, even in more focused institutions like MIT. The library, the archive, the IR, in different ways, are about collecting elements of the scholarly discourse that contribute both globally and locally. So institutional repositories are about generic continuity of data, as libraries are about continuity of collections. IRs create value for the institution, even if it is only a small piece of value (like most other individual "collections" in an institution). If you don't play, you aren't in the game. You know data has value, just not which bits. You need to disclose your scholarly assets, across the spectrum; you can feel proud of doing so, and make a case for local benefit at the same time. You are an institution taking part in a global system; the value may be in the network, but you are part of that network.

But the way an IR treats data is necessarily generic; if you get data from chemistry, engineering, social sciences and performing arts into this under-funded but potentially valuable repository, you will do your best but it will necessarily be variants of generic practice, at best.

So back to the "subject"; if there is a data repository here, it is likely staffed by "domain experts", capable of taking on a "community proxy" role. They know their stuff. They will treat their data in domain-specific ways; they will know where to seek out data to complement their collection, they will know how to make connections between different parts. They can describe it appropriately, they can develop standards with their colleagues. They will know how to help their colleague scientists extract maximum value. Some subject repository managers are seriously concerned about the problems for disciplines if institutional repositories expand into the data "space".

What subject repositories don't usually have, is what institutions have: substantial assets, endowments, bank balances, tenured staff. Usually based around multiple project grants, 5-year core funding is a prized goal at a price of cheese-paring funding, and mid-term reviews every second year. Subject repositories don't have assured continuity, temporal mass.

The NSB LLDDC (Long-lived digital data collections) report, and now the NSF CyberInfrastructure strategy, are aimed at this area; they have spotted the fragility of these subject data collections. In the UK we have possibly even more of a patchwork of funding mechanisms than was observed in the LLDDC report. JISC used to be a significant funder of subject repositories, but in recent years has been retrenching from them, while building up massive funding in IRs. AHRC, as we have seen, is pulling back from funding the AHDS.

So what would make this better? I'd like to see a substantive discussion about the roles and funding mechanisms of subject and institutional repositories. In the UK, this would have to involve at least the Research Councils, Wellcome Trust and JISC. (Perhaps looks less likely than it did when I first wrote this.)

Secondly, I'd like to see JISC in the final tranche of its capital funding (here's the circular recently closed) explore the bounds of what's possible with the data provider/ service provider combination (maybe OAI/ORE will address this a little? Maybe not!). And what if curation is detached from the repository? What if data continuity/preservation is separated from the curation service? Do these questions even make sense?

Maybe a system or federation of sustainable IRs internally divided into sets on subject lines (and hence externally aggregatable along those lines), with subject-oriented curation activities picking up on "invisible college" volunteerism might work? Splitting curation into generic and domain elements... Or other notions, pushing the skill out into the network, the federation, but retaining the data where the assets and continuity lie?

[This posting is based on an email I sent to a closed JISC Repositories advisory group some time ago; it seems even more relevant today...]


Post a Comment

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.