Digital Curation Blog: supplementary materials

Showing posts with label supplementary materials. Show all posts

Wednesday, 2 July 2008

Research Repository System data management

This is the sixth of a series of posts aiming to expand on the idea of the negative click, positive value repository, which I'm now calling a Research Repository System. I've suggested it should contain these elements:

Data management support is where this starts to link more strongly back to digital curation. Bear in mind here, this is a Research Repository System; not all of these functions, or the next group, need to be supported by anything that looks like one of the current repository implementations! I’m not quite clear on all of the features you might need here, but we beginning to talk about a Data Repository.

It is essential that the Data Management elements support current, dynamic data, not just static data. You may need to capture data from instruments, process it through workflow pipelines, or simply sit and edit objects, eg correcting database entries. Data Management also needs to support the opposite: persistent data that you want to keep un-changed (or perhaps append other data to while keeping the first elements un-changed).

One important element could be the ability to check-point dynamic, changing or appending objects at various points in time (eg corresponding to an article). In support of an article, you might have a particular subset available as supplementary data, and other smaller subsets to link to graphs and tables. These checkpoints might be permanent (maybe not always), and would require careful disclosure control (for example, unknown reviewers might need access to check your results, prior to publication).

Some parts of Data Management might support laboratory notebook capabilities, keeping records with time-stamps on what you are doing, and automatically providing contextual metadata for some of the captured datasets. Some of these elements might also provide some Health and Safety support (who was doing what, where, when, with whom and for how long).

Negative Click, Positive Value Research Repository Systems

I promised to be more specific about what I would like to see in repositories that presented more value for less work overall, by offering facilities that allow it to become part of the researcher’s workflow. I’m going to refer to this as “the Research Repository System (RRS)” for convenience.

At the top of this post is a mind map illustrating the RRS. A more complete mind map (in PDF form) is accessible here.

The main elements that I think the RRS should support are (not in any particular order):

Here’s a quick scenario to illustrate some of this. Sam works in a highly cross-disciplinary laboratory, supported by a Research Repository System. Some data comes from instruments in the lab, some from surveys that can be answered in both paper and web form, some from reading current and older publications. All project files are kept in the Persistent Storage system, after the disaster last year when both the PIs lost their laptops from a car overseas, and much precious but un-backed-up data were lost. The data are managed through the RRS Data Management element, and Sam has requested a checkpoint of data in the system because the group is near finalising an article, and they want to make sure that the data that support the article remain available, and are not over-written by later data.

Sam is the principal author, and has contributed a significant chunk of the article, along with a colleague from their partner group in Australia; colleagues from this partner group have the same access as members of Sam’s group. Everyone on the joint author list has access to the article and contributes small sections or changes; the author management and version control system does a pretty good job of ensuring that changes don’t conflict. The article is just about to be submitted to the publisher, after the RRS staff have negotiated the rights appropriately, and Sam is checking out a version to do final edits on the plane to a conference in Chile.

None of the data are public yet, but they are expecting the publisher to request anonymous access to the data for the reviewers they assign. Disclosure control will make selected check-pointed data public once the article is published. Some of the data are primed to flow through to their designated Subject Repository at the same time.

One last synchronisation of her laptop with the Persistent Storage system, and Sam is off to get her taxi downstairs…

This blog post is really too big if I include everything, so I [have released] separate blog posts for each Research Repository System element, linking them all back to this post... and then come back here and link each element above to the corresponding detailed bits.

OK I’m sure there’s more, although I’m not sure a Research Repository System of this kind can be built for general use. Want one? Nothing up my sleeve!

Tuesday, 15 April 2008

Wouldn't it be nice...

... if we had some papers for the 4th International Digital Curation Conference in Edinburgh (dinner in the Castle, anyone?) next December, created in a completely open fashion by a group of people who have perhaps never met, using data that's publicly available, and where the data and data structures in the paper are tagged (semantic web???), attached as supplementary materials, or deposited so as to be publicly available.

No guarantees, folks, since the resulting paper has to get through the Programme Committee rather than just me. But it would have a wonderful feeling of symmetry...

Wednesday, 29 August 2007

IJDC again

At the end of July I reported on the second issue of the International Journal of Digital Curation (IJDC), and asked some questions:

"We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

"For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet."

I didn't get any responses, but over on the eFoundations blog, Andy Powell was taking us to task for only offering PDF:

"Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy."

His blog is more widely read than this one, and he attracted 11 comments! The gist of them was that PDF plus HTML (or preferably XML) was the minimum that we should be offering. For example, Chris Leonard [update, not Tom Wilson! See end comments] wrote:

"People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions availble for precisely this purpose."

Cornelius Puschmann [update, not Peter Sefton] wrote:

"Yeah, but if you really want semantic markup why not do it right and use XML? The problematic thing with OJS (at least to some extent) is/was that XML article versions are not the basis for the "derived" PDF and HTML, which deal almost purely with visuals. XML is true semantic markup and therefore the best way to store articles in the long term (who knows what formats we'll have 20 years from now?). HTML can clearly never fill that role - it's not its job either. From what I've heard OJS will implement XML (and through it neat things such as OpenOffice editing of articles while they're in the workflow) via Lemon8 in the future."

Bruce D'Arcus [update, not Jeff] says:

"As an academic, I prefer the XHTML + PDF option myself. There are times I just want to quickly view an article in a browser without the hassle of PDF. There are other times I want to print it and read it "on the train."

"With new developments like microformats and RDFa, I'd really like to see a time soon where I can even copy-and-paste content from HTML articles into my manuscripts and have the citation metadata travel with it."

Jeff [update, not Cornelius Puschmann] wrote:

"I was just checking through some OJS-based journals and noticed that several of them are only in PDF. Hmmm, but a few are in HTML and PDF. It has been a couple of years since I've examined OJS but it seems that OJS provides the tools to generate both HTML and PDF, no? Ironically, I was going to do a quick check of the OJS documentation but found that it's mostly only in PDF!

"I suspect if a journal decides not to provide HTML then it has some perceived limitations with HTML. Often, for scholarly journals, that revolves around the lack of pagination. I noticed one OJS-based journal using paragraph numbering but some editors just don't like that and insist on page numbers for citations. Hence, I would be that's why they chose PDF only."

I think in this case we used only PDF because that was all our (old) version of the OJS platform allowed. I certainly wanted HTML as well. As I said before, we're looking into that, and hope to move to a newer version of the platform soon. I'm not sure it has been an issue, but I believe HTML can be tricky for some kinds of articles (Maths used to be a real difficulty, but maybe they've fixed that now).

I think my preference is for XHTML plus PDF, with the authoritative source article in XML. I guess the workflow should be author-source -> XML -> XHTML plus PDF, where author-source is most likely to be MS Word or LaTeX... Perhaps in the NLM DTD (that seems to be the one people are converging towards, and it's the one adopted by a couple of long term archiving platforms)?

But I'm STILL looking for more concrete ideas on how we should co-present data with our articles!

[Update: Peter Sefton pointed out to me in a comment that I had wrongly attributed a quote to him (and by extension, to everyone); the names being below rather than above the comments in Andy's article. My apologies for such a basic error, which also explains why I had such difficulty finding the blog that Peter's actual comment mentions; I was looking in someone else's blog! I have corrected the names above.

In fact Peter's blog entry is very interesting; he mentions the ICE-RS project, which aims to provide a workflow that will generate both PDF and HTML, and also bemoans how inhospitable most repository software is to HTML. He writes:

"It would help for the Open Access community and repository software publishers to help drive the adoption of HTML by making OA repositories first-class web citizens. Why isn't it easy to put HTML into Eprints, DSpace, VITAL and Fez?

"To do our bit, we're planning to integrate ICE with Eprints, DSpace and Fedora later this year building on the outcomes from the SWORD project – when that's done I'll update my papers in the USQ repository, over the Atom Publishing Protocol interface that SWORD is developing."

So thanks again Peter for bringing this basic error to my attention, apologies to you and others I originally mis-quoted, and I look forward to the results of your efforts! End Update]

Tuesday, 31 July 2007

IJDC Issue 2

I am very happy that Issue 2 of the International Journal of Digital Curation (IJDC) is now out. This open access journal issue contains 7 peer-reviewed papers and 7 general articles.

I am listed as editor, and did write the editorial, but Richard Waller of UKOLN did all the hard editorial work, in between his day job on Ariadne. Not to mention our authors, of course! My heartfelt thanks to all concerned.

We have a few articles in the pipeline for Issue 3, but I do want to get up a good head of steam. So if you, dear reader, care about digital curation, the care and feeding of science data, the relationship of data to publication, the rights and wrongs of access to data, and/or digital preservation, and have some research results to put forward, please do write us a paper or an article!

We are aware, by the way, that there is a slight problem with our journal in a presentational sense. Take the article by Graham Pryor, for instance: it contains various representations of survey results presented as bar charts, etc in a PDF file (and we know what some people think about PDF and hamburgers). Unfortunately, the data underlying these charts are not accessible!

For various reasons, the platform we are using is an early version of the OJS system from the Public Knowledge Project. It's pretty clunky and limiting, and does tend to restrict what we can do. Now that release 2 is out of the way, we will be experimenting with later versions, with an aim to including supplementary data (attached? External?) or embedded data (RDFa? Microformats?) in the future. Our aim is to practice what we may preach, but we aren't there yet.

If anyone knows how to do this, do please get in touch!

Wednesday, 30 May 2007

Supplementary Material in journal publishing

I went to a couple of interesting events in March and April. The first was the eJournal Archiving & Preservation Workshop held at the British Library on 27 March, 2007 (sponsored by the Digital Preservation Coalition, BL and JISC). The keynote was Anne Kenney talking about the "Metes and Bounds" report on eJournal preservation, and there were many good talks. But the interesting thing in this context was how often the problem of "supplementary materials" came up again and again in questions. This meant all sorts of things, including multimedia, but in particular this related to associated data. A good example of a journal that specifically aims to exploit the Internet's capabilities in providing access to supplementary materials (rather than just transporting two-column PDFs around for me to print off) is Internet Archaeology. The problem of course is that this kind of material falls outside the bounds of things that can be easily preserved with the developing preservation systems.

The next event was an ALPSP workshop on the transformation of research communication, on 13 April at the IMechE in London. This event was mainly aimed at publishers, but again supplementary materials and specifically data came up several times in questions and panel sessions, although this time data also came up in presentations as well (eg Christine Borgman and Simon Coles).

Of course, while having a journal literature that enables connection to underlying data is a pretty neat idea, not everyone is happy with the publishers taking control of that data and adding it to their lock-in of scholarly intellectual property. Generally I agree with these concerns, but we have yet to establish a sustainable alternative model for providing continuing access to these data either. I think this one will run and run!

Digital Curation Blog