Digital Curation Blog: Metadata

Showing posts with label Metadata. Show all posts

Friday, 14 August 2009

DCC web site and Linked Data

We at the DCC are in the early stages of refreshing our web site (www.dcc.ac.uk). Nothing you can see yet, but we're talking to a few consultants about what and how we can do better. The ones we have spoken to so far seem pretty clued up on content management systems, and even on web 2.0 approaches. But questions about the role of the Semantic Web or Linked Data get blank looks.

Now our web site is not and will probably never be a major source of data as facts; rather it should contain resources: often documents, sometimes tools, sometimes sharing opportunities. There definitely are facts of various kinds there (which may not sufficiently explicit yet), such as staff contact details, document metadata, event locations and times, etc. But these are a comparatively small part of the content.

Does this (or anything else) justify investment in building a web site that is based on Linked Data/Semantic Web? What advantages could we get in doing so? What advantages could our users get if we did so?

I would really like to get some views on this!

Friday, 5 December 2008

Bryan Lawrence on metadata as limit on sustainability

Opening the Sustainability session at the Digital Curation Conference, Bryan Lawrence of the Centre for Environmental Data Archival and the British Atmospheric Data Centre (BADC), spoke trenchantly (as always) on sustainability with specific reference to the metadata needed for preservation and curation, and for facilitation for now and the future. Preservation is not enough; active curation is needed. BADC has ~150 real datasets but thousands of virtual datasets, tens of millions of files.

Metadata, in his environment, represents the limiting factor. A critical part of Bryan’s argument on costs relates to the limits on human ability to do tasks, particularly certain types of repetitive tasks. We will never look at all our data, so we must automate, in particular we must process automatically on ingest. Metadata really matters to support this.

Bryan dashed past an interesting classification of metadata, which from his slides is as follows:

A – Archival (and I don’t think he means PREMIS here: “normally generated from internal metadata”)
B – Browse: context, generic, semantic (NERC developing a schema here called MOLES: Metadata Objects for Linking Environmental Sciences)
C – Character and Citation: post-fact annotation and citations, both internal and external
D – Discovery metadata, suitable for harvesting into catalogues: DC, NASA-DIF, ISO19115/19139 etc
E – Extra: Discipline-specific metadata
O – Ontology (RDF)
Q – Query: defined and supported text, semantic and spatio-temporal queries
S – Security metadata

The critical path relates to metadata, not content; it is important to minimise the need for human intervention, and this means minimising the number of ingestion systems (specific processes for different data types and data streams), and to minimise the types of data transformations required (problem is validating the transformations). So this means that advice from data scientists TO the scientists is critical before creation; hence the data scientist needs domain knowledge to support curation.

Can choose NOT TO TAKE THE DATA (but the act of not taking the data is resource intensive). Bryan showed and developed a cost model based on 6 components; it’s worth looking at his slides for this. But the really interesting stuff was on his conclusions on limits, with 25 FTE:

"• We can support o(10) new TYPES of data stream per year.
• We can support doing something manually if it requires at most:
– o(100) activities of a few hours, or
– o(1000) activities of a few minutes, but even then only when supported by a modicum of automation.
• If we have to [do] o(10K) things, it has to be completely automated and require no human intervention (although it will need quality control).
• Automation takes time and money.
• If we haven’t documented it on ingestion, we might as well not have it …
• If we have documented it on ingestion, it is effectively free to keep it …
• … in most cases it costs more to evaluate for disposal than keeping the data.
• (but … it might be worth re-ingesting some of our data)
• (but … when we have to migrate our information systems, appraisal for disposal becomes more relevant)”

Interestingly, they charge for current storage costs for 3 years (only) at ingest time; by then, storage will be “small change” provided new data, with new storage requirements, keep arriving. Often the money arrives first and the data very much later, so they may have to be a “banker” for a long time. They have a core budget that covers administration, infrastructure, user support, and access service development and deployment. Everything changes next year however, with their new role supporting the Inter-Governmental Panel on Climate Change, needing Petabytes plus.

Tuesday, 14 October 2008

ARROW Repositories day: 1

I’ve been giving a talk about the Research Repository System ideas at the ARROW repository day in Brisbane, Australia (which is partly why there has been a gap in posting recently). Here are some notes on the other talks.

Kate Blake from ARROW is talking about metadata. Particularly important for data, which cannot speak for itself. Metadata thought of as a compound object that comprises some parts for “library management” issues (things like author, title, keyword) for the whole document and/or its parts, plus University management parts, such as evidence records for research quality management purposes. These link to metadata that applies to the community of practice, eg the appropriate metadata for an X-ray image. Have the content (maybe a PDF), its rich metadata (Kate used MARC/XML as an example, which surprised me, since she also suggested this group was specific to the content), lightweight descriptive metadata, technical metadata (file size, type etc), administrative metadata, eg rights or other kinds of institutional metadata, preservation metadata such as PREMIS, and both internal and external relationship metadata. METS is one way to wrap this complex set of metadata and provide a structural map (there are others). (Worrying that this seems like a very large quantity of metadata for one little object…) Aha, she’s pointing out that aggregating these into repositories and these repositories together across search services leads ot problems of duplication, inconsistency, waste of effort, etc. So lots of work trying to unravel this knot, lots of acronyms: RDF, FRBR, SWAP, DCMI AM, SKOS etc…

FRBR making the distinction between the work, its expressions, manifestations and items. SWAP being a profile for scholarly works, as text works, not much use for data.

Names appear in multiple places in the metadata, and different parts have different rules. Do we have a name agent service (registries)? Need services to gather metadata automatically, that way you might introduce consistency and interoperability.

Kylie Pappalardo from QUT’s OAK Law project on legal issues on managing research data so it can be included in a repository and accessed by others. Government statements in favour of openness (eg Carr: More than one way to innovate, also “Venturous Australia” strategy). To implement these policies we need changes to practice and culture, institutional engagement, legal issues being addressed, etc. Data surrounded by law (!): copyright, contract, patents, policies, confidentiality, privacy, moral rights. Conflicting legal rights: who can do what with the data? QUT has OAK Law and also Legal Framework for e-Research project.

Survey online, May 2007, 176 participants responded. 50 depositing data in database; of those 46% said available openly, 46% required some or complete restrictions. 54% said their organisation did NOT have a data policy at all; where they did have a policy, most were give guidelines. 55% said they prepared plans for data management; two thirds of these at time of proposal, balance later. Should be early, not least because data management costs and should be part of the proposal, also disputes can be hard to resolve later. 7% felt that clearer info on sharing and re-use would help, and 90% wanted a “plain English” guide (who wouldn’t?). Lawyer language doesn’t help, so researchers make their own informal agreements… maybe OK if nothing goes wrong.

The group has a report: analysis of Legal Context of Infrastructure for Data Access and Re-use in Collaborative Research. Also Practical Data Management: a Legal and Policy Guide. They have some tools, including a “simple” Data Management (legal) toolkit to fill in to gather information about (eg) copyright ownership etc.

Peter Sefton of USQ talking about OAI-ORE, and what it can do for us. Making the point that we build things from a wide variety of standard components, which (mostly) work pretty well together, eg found bricks in a garden wall… OAI-PMH mostly works, moving metadata from one place to another. But it’s just the messenger, not the message. So a harvest of metadata across multiple repositories shows wide variations in the keywords, subjects etc. Problems with XACML for defining access policies: no standardisation on the names of subgroups, so in the end it’s no use for search. Point being that these standards may appear important but not work well in practice.

So on to ORE (Object Re-use and Exchange)… Pete asks “shouldn’t it be exchange, then re-use?”. ORE view of a blog post: a resource map describes the post, but in fact it’s an aggregation (compound object) of HTML text, a couple of images, comments in separate HTML, etc. The aggregation does have a URI, but does not have a fetchable reality (the resource map does). Can get complex very rapidly. See Repository Challenge at OpenRepositories 2008 in Southampton, and the ORE Challenge at RepoCamp 2008. USQ participating with Cambridge in JISC-funded TheOREM project, also internal image project called The Fascinator. Based on his ICE system that has been mentioned before, integrated with ORE tools to push stuff into the repository. Can have the repository watching so it can get it for itself.

ORE can: supplement OAI-PMH for moving content around; improve research tols like Zotero; replace use of METS packages; allow “thesis by publication” more elegantly; and pave the way for a repository architecture that understands content models (no more discussion of atomistic versus compound objects).

Thursday, 7 August 2008

Repositories and the CRIS

As I mentioned in the previous post, there has been some discussion in the JISC Repositories task force about the relationship between repositories and Current Research Information Systems (CRIS). Stuart Lewis asserted, for example, that “Examples of well-populated repositories such as TCD (Dublin) and Imperial College are backed by CRISs.” So it seems worth while to look at the CRIS with repositories in mind.

There has been quite a bit of European funding of the CRIS concept; not surprising perhaps, as research funders would be significant beneficiaries of the standardisation of information that could result. An organisation (EuroCRIS) has been created, and has generated several versions of a data model and interchange standard (which it describes as the EC-recommended standard CERIF: Common European Research Information Format), of which the current public version is known as CERIF 2006 V1.1. CERIF 2008 is under development, and a model summary is accessible; no doubt many details are different, but it does not appear to have radically changed, so perhaps CERIF is stabilising. Parts of the model are openly available at the EuroCRIS web site, but other parts require membership of EuroCRIS before they are made available. This blog post is only based on part of the publicly accessible information.

I decided to have a look at the CERIF 2006 V1.1 Full Data Model (EuroCRIS, 2007) document, with a general aim of seeing how helpful it might be for repositories. Note this is not in any sense a cross-walk between the CERIF standards and those applicable to repository metadata.

Quoting from a summary description of the model:

“The core CERIF entities are Person, OrganisationUnit, ResultPublication and Project. Figure 1 shows the core entities and their recursive and linking relationhips in abstract view. Each core entity recursively links to itself and moreover has relationships with other core entities. The core CERIF entities represent scientific actors and their main research activities.
Figure 1: CERIF Core Entities in Abstract View”

The loops on each entity in the diagram above represent the recursive relationships, so a publication – publication relationship might represent a citation, or one of a series, or a revision, etc. Entity identifiers all start cf, which might help make the following extract more understandable:

“The core CERIF entities represent scientific actors (Persons and Organisations) and their main research activities (Projects and Publications): Scientists collaborate (cfPers_Pers), are involved in projects (cfProj_Pers), are affiliated with organisations (cfOrgUnit_Pers) and publish papers (cfPers_ResPubl). Projects involve people (cfProj_Pers) and organisations (cfProj_OrgUnit). Scientific publications are published by organisations (cfOrgUnit_ResPubl) and refer to projects (cfProj_ResPubl), publications involve people (cfPers_ResPubl), organisations store publications (cfResPubl), support or participate in projects (cfProj_OrgUnit), and employ people (cfPers_OrgUnit). To manage type and role definitions, references to classification schemes (cfClassId; cfClassSchemeId) are employed that will be explained separately…”

The model adds other “second-level” entities:

Figure 5: CERIF 2nd Level Entities in Abstract View

As Ian Stuart wrote in the repositories discussion:

“The current repository architecture we have is a full deposit for each item (though some Repositories jump round this with some clever footwork) - which runs straight into the "keystroke" problem various people have identified.

With a CRIS-like architecture, the user enters small amounts of [meta-]data, relevant to the particular thing - be it a research grant; a presentation at a conference; a work-in-progress; a finished article; whatever.... and links them together. It is these links that then allows fuller metadata records to be assembled.”

It should be clear that having this much information available should make populating a repository relatively trivial task. It's not entirely clear how easily the metadata will transfer, although it does appear that Dublin Core metadata has been used in the model, but whether this is mandatory or optional is not clear from this document. I suppose it’s also not clear how much the CERIF standard is used in actual implementations.

I’ll leave the penultimate word to Stuart Lewis, who wrote

“[The CRIS] doesn't replace the repository, but offers a management tool for research. Inputs to the CRIS can include grant record systems, student record systems, MIS systems etc, and outputs can include publication lists and repositories… Now we just need an open source CRIS platform... E.g.:

http://www.symplectic.co.uk/products/publications.html
http://www.atira.dk/en/pure/
http://www.unicris.com/lenya/uniCRIS/live/rims.html”

Note, I have not checked out any of the above products in any detail, although the 3rd one appears to be a non-responsive web-site; use at your own risk! And it should be obvious that anything that links into MIS systems in any institution is likely to require a major implementation effort. It may also suffer from institutional silo effects (links to my institutions management information, but not my collaborators’ information).

Reference:

EuroCRIS. (2007). CERIF2006-1.1 Full Data Model (FDM) Model Introduction and Specification (Recommended standard). http://www.dfki.de/%7Ebrigitte/CERIF/CERIF2006_1.1FDM/CERIF2006_FDM_1.1.pdf

Digital Curation Blog

Friday, 14 August 2009

DCC web site and Linked Data

Friday, 5 December 2008

Bryan Lawrence on metadata as limit on sustainability

Tuesday, 14 October 2008

ARROW Repositories day: 1

Thursday, 7 August 2008

Repositories and the CRIS

Creative Commons

Blog Archive

Contributors

Labels