Digital Curation Blog: April 2009

Tuesday, 21 April 2009

Lazyweb: Mac backup software associated with Iomega Jaz?

In my data recovery sideline, I thought I might tackle some of my own ancient media. I have a number of Mac backups on ancient Jaz 1 GB disks, written in the period 1997-1999. Andrew Treloar of ANDS has recovered the contents of those files onto CD-ROM, so I'm not looking to read the media any more. There are 2 backups on each of 2 Jaz disks, and the CD has 4 files from each Jaz disk. The earliest two have the following names:

FIGIT 2.971113A #001
FIGIT 2.971113A.FULL

I'm guessing these were written on 13 November, 1997. Definitely from a Mac. Does anyone know what backup software wrote these? I would like to recover the contents if possible!

The .FULL is about 250 KB, while the #001 appears to be around 300 MB and is presumably the actual backup!

Any ideas on what the backup software might be? I thought it might be some standard software on the Iomega install. The install disks I still have don't read properly, but I managed to get sight of a directory for a French DOS install disk, and couldn't see any file names that looked like backup software!

Sunday, 19 April 2009

Engaging e-Science with Infrastructure

Last Friday I was at the National e-Science Centre (NeSC) for the second day of a workshop on e-Science; unfortunately I wasn’t able to get to the first day. I tweeted most of it with the Twitter hashtag #eresearch, so you might learn all you need to just be searching for that hashtag (although it is used for other things, especially by some of the Australians involved in ANDS). One thing I did like was the use of e-Science quite generally, without a specific focus on any one technology (ie it was not GRID-dominated!). However, that does it make it a little harder to define the e-Science bit.

Leif Laaksonen from Helsinki, Chair of the e-Infrastructure Reflection Group, gave the keynote (not sure if, or when, the slides will be made available). The e-IRG prepares white papers and other documents aiming to realise the “vision for the future […] an open e-Infrastructure enabling flexible cooperation and optimal use of all electronically available resources”. He mentioned the European Strategy Forum on Research Infrastructures (ESFRI); this group must be doing something right, as apparently they have 10 billion Euros to play with to build and maintain this infrastructure! He spoke quite a lot about sustainability, but it appeared that this means not paying for infrastructure through a succession of European projects, but rather through sustained funding through national governments. Hmmm. I wondered what happened when a country has a bad budget year and cuts its infrastructure project; how much of the global infrastructure can be damaged by that?

Laaksonen had a number of interesting diagrams in his presentation, which can be seen from the web site. One that alarmed me (slide 17?) had an un-differentiated layer of “data” just above the network layer. I worry that is a dangerously simplistic summary; the data layer is far more fractured than that, with disciplinary, sub-disciplinary, even sub-group different approaches to curation.

Laaksonen did show a great slide, dating from 2003, showing routes of innovation from academic research to industrial acceptance. Not a monotonic progress!

This was followed by 3 quick presentations, one on the National Grid Service (not the one that supplies electricity and gas, but the one that supplies compute and data storage resources, using, yes, GRID technologies. The second was on OMII-UK, aiming to sustain community software developments, but itself, like the DCC, perhaps facing its own sustainability crisis? Finally, I gave a short presentation on the DCC, based on the one I gave to the environmental data managers a month or so ago. Then Bruce Beckles from Cambridge gave a wonderfully enthusiastic talk on being a one-man support service for e-Science within the Cambridge IT infrastructure.

After an hour-long panel session (no notes as I was on the panel; there was quite a focus on data, and on education & training), and lunch, there was an interesting demo session in the afternoon. This was organised so that each demo was given twice, and there were 3 sessions of 20 minutes each, so you could see quite a lot. I took in a Taverna demo (workflows, which I've wanted to understand better for some time), found I was the sole person in the second demo on NaCTeM (which meant I could ask all the text mining questions I wanted), and then saw the e-Science Central demo. Disappointing that the latter invented their own workflow system, although they claim there were good reasons, and they are hoping to backstitch Taverna in later (@lescarr tweeted back to me that their site doesn't use UTF-encodings, so if you hover over their rather nice cartoon images, the captions come up all wonky!).

Finally the wrap-up session, chaired by David de Roure (not as advertised). We were meant to find “just 3 things” that needed to be done to move e-Science more firmly into the national infrastructure. But of course our enthusiasm got the better of us, and we couldn’t stop. I think David and Malcolm Atkinson between them have the job of winnowing it down to the 3 top priorities. Altogether, a very interesting day; it’s good to see data becoming a real priority in e-Science!

Amiga disk data recovery: progress and limitations

You may remember that I have been attempting to recover files and content from various sources from 10 or more years ago. One of these was an Amiga disk. On the label is the note: "Dissertation 17/4/96, CV.asc, CV, 29 September 1996".

I’ve described earlier some attempts to get the Catweasel controller to read the disks. After eventually figuring out how to configure the disk-reading program ImageTool3 for the Catweasel, I tried the Amiga disk. It worked fine, with as far as I can see zero errors. From a cursory scan of Google, I reckon this raw disk format is known as ADF, so I renamed it XXXAmiga.adf (.adf was one of the candidate extension names under the selected "Plain" category for the ImageTool3 program).

Now, of course, we have to work out first how to extract files from the disk image, and then how to convert your particular file format into a modern day format.

Just simply reading the raw disk image with Notepad on Windows or Textedit (on my Mac) shows that there is real text there, that made sense to my colleague (see below)!

A comment from “Euan” on my earlier post suggested that we try the WinUAE Amiga emulator, and my colleague did that. He reported:

“Success. I've not only got the WinUAE Amiga emulator working, but managed to find a copy of the application that I wrote my CV and dissertation in (Final Writer 5) and have been able to read the files off the disk image you sent and display them (screenshots attached).

Not having any luck reading the individual files directly [CR: from his Windows system], though -- other than the odd word related to fonts and colours -- but then they are in native FW5 format.”

Image from the dissertation seen in Final Writer

I asked if he was able to do any "save as" operations in his emulated Final Writer program, to move files from the disk image into the Windows file store. He reported:

“I've tried re-saving my files in another format, if that's what you mean, but the program doesn't do anything -- I can select Save > Save As... from the menu but nothing happens. However, I can see all my individual dissertation files from my PC as the file system is mapped onto a directory.”

Raw image from the same part of the dissertation grabbed from Textedit on the Mac

It was remarkable how much could be read directly in the disk image!

Now my colleague was able to read his CV.asc file with Notepad on Windows, but so far we have not been able to convert the dissertation to a modern format, nor to connect the Final Writer program inside the emulator to a printer. Frustratingly close, but still not quite where we would like to be. I did find a demo copy of Final Writer for Windows 95 on the WayBack machine (earliest lift of the site, also 1996), but unfortunately this wouldn't open the existing image unless we upgrade to the full-featured version... but the company appears to have gone bust in 1996-7 or thereabouts!

So what have we learned from this?

It is possible to read a 13 year old floppy disk from an obsolete machine with an apparently incompatible disk format, kept under conditions of less than benign neglect, using cheap hardware on a recent Windows PC.
It is possible to access the files from the obsolete operating system using an emulator that appears to have been written by spare time volunteers.
It is possible to run the original application that created some of these files, under the emulator, and to read and process them (but not, so far, to save in another format).
Using the emulator is valuable, but constraining (in being unfamiliar technology, with few manuals etc) and limiting (in not, so far, being able to do much more with the files). We would now like to migrate them to a modern environment; for my colleague, this means Windows or Linux.

Fascinating!

Wednesday, 15 April 2009

5th International Digital Curation Conference: Call for Papers

The Call for Papers for the 5th International Digital Curation Conference has just been published. With the title "Moving to Multi-Scale Science: Managing Complexity and Diversity", the conference will be held in London from 2-4 December, 2009. I believe this is THE conference for papers on advances in digital and data curation! The text of the call follows:

We invite submission of full papers, posters, workshops and demos and welcome contributions and participation from individuals, organisations and institutions across all disciplines and domains that are engaged in the creation, use and management of digital data, especially those involved in the challenge of curating data for e-science and e-research.

Proposals will be considered for short (up to 6 pages) or long (up to 12 pages) papers and also for demonstrations, workshops and posters. The full text of papers will be peer-reviewed; abstracts for all posters, workshops and demos will be reviewed by the co-chairs. Final copy of accepted contributions will be made available to conference delegates, and papers will be published in our International Journal of Digital Curation [external]. Accordingly, we recommend that you download our template and read the advice on its use.

Papers should be original and innovative, probably analytical in approach, and should present or reference significant evidence (whether experimental, observational or textual) to support their conclusions.

Subject matter could be policy, strategic, operational, experimental, infrastructural, tool-based, and so on, in nature, but the key elements are originality and evidence. Layout and structure should be appropriate for the disciplinary area. Papers should not have been published in their current or a very similar form before, other than as a pre-print in a repository.

We seek papers that respond to the main themes of the conference: multi-scale, multi-discipline, multi-skill and multi-sector, and that relate to the creation, curation, management and re-use of research data. Research data should be interpreted broadly to include the digital subjects of all types of research and scholarship (including Arts and Humanities, and all the Sciences). Papers may cover:
Curation practice and data management at the extremes of scale (e.g. interactions between small science and big science, or extremes of object size, numbers of objects, rates of deposit and use)
Challenging content: (e.g. addressing issues of data complexity, diversity and granularity)
Curation and e-research, including contextual, provenance, authenticity and other metadata for curation (e.g. automated systems for acquiring such metadata)
Research data infrastructures, including data repositories and services
Disciplinary and inter-disciplinary curation challenges and data management approaches, standards and norms
Promoting, enabling, demonstrating and characterizing the re-use of data
Semantically rich documents (e.g. the “well-supported article”)
The human infrastructure for curation (e.g. skills, careers, training and organisational support structures, careers, skills, training and curriculum)
Curation across academia, government, commerce and industry
Legal and policy issues; Creative Commons, special licences, the public domain and other approaches for re-use, and questions of privacy, consent, and embargo
Sustainability and economics: understanding business and financial models; balancing costs, benefits and value of digital curation
Important Dates
Submission of papers for peer-review: 24 July 2009
Submission of abstracts posters/demos/workshops: 24 July 2009
Notification of authors of papers: 18 September 2009
Notification of authors of posters/demos/workshops: 2 October 2009
Final papers deadline: 13 November 2009
Final posters deadline: 13 November 2009

Heroic data recovery story: 40-year old lunar orbital data

Thanks to Andrew Treloar for the heads up on a wonderful data recovery story. Andrew’s link was to a Slashdot article; there’s not a lot of information there. After a bit of searching around, I found a longer version of the story from last November at CollectSpace, and an interesting discussion on WattsUpWithThat (dated 1 April, which is why I emphasised the November date!).

Roughly the story is that NASA has been sitting on a collection of some 1500 2” analogue tapes which captured high resolution images taken by the Lunar Orbiter mission in 1964 that preceded the Moon landings. The aim of this program was to identify sites for the landings. Although NASA only used low resolution versions of the images (including the famous “Earth Rise” image, see Wikipedia article), the high resolution images were recorded onto the tapes by special Ampex FR-900 instrumentation recorders. Many years later, the archivist concerned refused to throw the tapes out. And eventually even found 3 non-functioning drives, which she stored in her garage, until news of the story leaked out to some interested people, including at least one retired Ampex engineer. Since then, with a little funding from NASA, and some space in an old McDonalds at NASA Ames, they have been attempting to refurbish the drives, and have managed to get at least two images off the tapes. If they can find more manuals and parts, they hope eventually to read all the tapes.

There’s lots that is interesting here. I particularly liked the response to this question from a reader: “What’s being done about the tape itself? You can’t just pull it out of the tins and thread it up.” Dennis Wingo, the main engineer involved replied:

“You know, amazingly enough that is exactly what we do. I am continually amazed that this works as I used to run a TV studio in LA where we could not do that even 20 years ago, but we have had an amazingly small number of tapes that even produce head clogs. We have run some tapes as many as 20-30 times in doing testing. Absolutely amazing to us.

I heard a story, that seems plausable but not sure about it really. It turns out that in 1975 all of the vendors changed the formula for the adhesive that holds the iron particles to the tape back. The new formulation had problems with moisture and over time degraded significantly.

However, the tape before 1975 that did not have this problem was made with a different mixture that included WHALE OIL, I kid you not. I don’t know if this is true […].”

One commenter had a major set of suggestions on how to deal with the stream if all they had was the analogue signal. Dennis responded, in part:

“We are using PCIe digitizers on a Mac Pro workstation to capture our data so we are able to get high data rate captures up to 180 megasamples per second.

We can do a lot with that but we would have to do a LOT of digital post processing including doing the demodulation in the digital domain, which is NOT going to be easy. Much easier to do in hardware, especially since we have spent the money doing so.”

Hey, that'se a hardware engineer talking! I guess it means he’s optimistic to be able to make and keep the drives alive long enough to digitise the entire set, rather than trying any more radical software solutions.

Data recovery from > 40 years ago, anyone? It is heroic, but let’s not be frightened!

Monday, 6 April 2009

Libraries of the Future: SourceForge as Repository?

In his talk (which he pre-announced on his resumed blog), Peter Murray-Rust (PMR) suggested (as he has done previously) that we might like to think of SourceForge as an alternative model for a scholarly repository (“Sourceforge. A true repository where I store all my code, versioned, preserved, sharable”). I’ve previously put forward a related suggestion, stimulated by PMR’s earlier remarks, on the JISC-Repositories email list, where it got a bit of consideration. But this time the reaction, especially via the Twitter #lotf09 feed, was quite a bit stronger. Reactions really fell into two groups: firstly something like: SourceForge was setting our sights way too low; it’s ugly, cluttered by adverts, and SLOOOOOWWW (I can’t find the actual quote in my Twitter search; maybe it was on Second Life). Secondly, the use cases are different, eg @lescarr: “Repositories will look like sourceforge when researchers look like developers” (it might be worth noting that PMR is a developer as well as a researcher).

Through my JISC-Repositories explorations, I was already beginning to be convinced by the “different use-case” argument, although it was not articulated that way. SourceForge acts as a platform providing a range of services and utilities supporting the development of software by disparate teams of developers. Is there a parallel use case in the repositories area?

The closest I could come was some of my suggestions from my Research Repository System posts, for moving the repository upstream in the researcher’s workflow, and specifically helping authors write multi-author articles. This is a reasonable parallel in several ways, but there are some significant differences.

Software source files are simple text files, completely independent of whatever system is used to edit them. The source of a software system comprises many individual software modules, each of which is usually in a separate file and can be edited independently. There are strong file name conventions, and complex “recipe” systems (eg make files) for converting them from source to actionable programs, combining them with software libraries and system-wide variables. If one developer changes one module, the system “knows” how to rebuild itself.

Journal articles, by contrast, are mostly written using proprietary (sometimes open source) office systems which create and edit highly complex files whose formats are closely linked to the application. Apart from some graphic and tabular elements, an article will usually be one file, and if not, there is no convention on how it might be composed. Workflow for multi-author articles is entirely dependent on the lead author, who will usually work with colleagues to assign sections for authoring, gather these contributions, assemble and edit them into a coherent story, circulating this repeatedly for comments and suggested improvements (sometimes using tracked change features).

Using LaTeX and include files, it might be possible to re-create a similar approach to that used in software, but for the human reader, the absence of a single editorial “voice” would be very apparent. And anyway, most of us just don’t want to work that way. I got out of the habit of writing in LaTeX around 20 years ago, when word processors began to arrive on workable desktop computers (Word on the early Macs), and I’d prefer not to go back.

Oh, and while it’s a “category mistake” to confuse SourceForge with any particular version control system, it looks to me (and at least one respondent on JISC-Repositories) as if distributed systems like git are fitter for purpose (in any kind of distributed authoring system) than their more centralised companions like VCS, SVN etc.

Well, is PMR’s suggestion of SourceForge as a model for scholarly repositories entirely a dead duck? I think it probably is, but I do hope that those who design repository systems, and indeed virtual research environments, do think hard about the capabilities of such systems when thinking through their responses to user requirements. And particularly where those requirements relate to what the researchers and authors need, rather than their institutions, libraries or readers.

Libaries of the Future event

I attended the JISC Libraries of the Future event last Thursday, and in the end quite enjoyed it. I was glad it was late starting, as I managed to get myself lost on the way from the station (RTFM, Chris!). Maybe I was too hot and bothered to pay the right kind of attention, but I’m afraid I found most of the presentations rather too bland for my taste. I enjoyed the Twitter back-chat (Twitter hashtag #lotf09, if you’re interested), but found that participating in it reduced, or perhaps changed my concentration on what was being said (you have to listen hard, but you also have to type your tweet, and concentration dropped off in the latter). This was the first time I have seen the Twitter channel (plus Second Life chat) projected near-live above the speakers; it perhaps helped make Twits like me better-behaved.

Even Peter Murray-Rust’s speech (see his blog where he wrote what he would say), which had plenty of passion, was less strong on content than I was hoping. But 15 minutes is pretty limiting. I’ll comment on one aspect of Peter’s remarks (SourceForge as exemplar for repositories) separately.

Probably the best moments of the day came in responses from Citizen Darnton of Harvard to audience questions (Wikipedia); Darnton had spoken on the citizen’s perspective. Mis-quoting him a bit, he seemed to be suggesting that librarians might exercise their market power by cancelling all expensive journals, apparently Max Planck Institutes had done just that. Well that would certainly undo the myth that libraries had been doing nothing for scientists! (CR: Perhaps a librarian might allocate say 2/3 of the original licence fee for paid access to articles on a first-come, first-served basis, thus both making a real budget saving, proving whether or not scientists really used the journals, and providing incentives to use the OA versions! OK, the politics are a bit tough…)

Questioned after the Google man had explained how excellent it was that Google should digitise books and make them available, Citizen Darnton again made some interesting comments, this time a hint at the monopoly risks inherent in the Google Book Settlement. I won’t attempt to summarise his remarks as I’d be sure to get them wrong, but he certainly had nightmares thinking of the possible down-sides of that settlement, and the effective monopoly concentration of copyright that may ensue; the loss of control of the world’s knowledge from libraries to Google. Here are some tweets, in reverse order (I am a bit unsure of the ethics here):

• tomroper: #lotf09 monopoly. Contradiction between private property and public good. Google's mission (organise world's knowledge)
• cardcc: Comment on serious problem of Google Book Settlement: sounds bad; effectively going to Google control. #lotf09
• scilib: seems to be a common theme where libraries/academics surrender copyright to tech/powerful orgs... and then regret it #lotf09
• Gondul: Hand grenade question. We (libraries) are concerned about Google, yet Google unaware of us. #lotf09
• bethanar: #lotf09 Robert Darnton 'google is obviously a monopoly ... no other company can mount a rival digitisation project'
• tomroper: Robert: Google greatest ever seen; but orphan books, 1923-64 publications,; revenue to go to authors and publishers #lotf09
• cardcc: Q: Control of world's knowledge moving out of control of libraries, should we be worried? Harvard man has nightmares on Google #lotf09

Two other things stick in my memory. First, Chris Gutteridge (main author of the ePrints.org repository software platform) made a passionate interjection (or two) on the iniquities of the current scholarly communications system. It was largely what most people there already knew: “We do the research, we write the articles, we give them to the publishers, we do the peer review, then we have to buy them back at vast cost.” But knowing it, and being reminded with such passionate force, are different.

And finally, there was a consultant whose name I didn’t catch, a serial interjector of the kind who doesn’t believe he needs a microphone even if the event is being recorded, web cast and on Second Life, who asked something like “how come libraries have been sleeping on the job, allowing organisations like Google to steal the initiative and their territory?” At a facile level, maybe not a bad question, but surely it doesn’t take much analysis to notice the critical differences for this kind of innovation? Unlike libraries, Google has massive capital, the ability to invest it speculatively, makes huge surpluses on its day to day activities, can employ the best and most expensive software engineers (and lawyers), and a willingness to accept high risk for high return. By contrast, Universities and their libraries mostly have meagre capital, have a desire to invest as much as possible into their research and teaching, make the least possible surpluses consistent with their funding rules (see previous clause), pay their staff on union rates, (used to be) well under the market rate, and are required to be risk-averse (else there would be as many failed universities as failed software companies). As an example, when did you last see a library that was able to give its staff one day a week for their own projects? Much more like 6 days work for the price of 5, from what I see. Via Twitter, Lorcan Dempsey also suggested “Why didn't libraries do it? Libraries are institution-scale; difficult to act at other scales. Google is webscale”.

And no doubt there was more that my failing memory has discarded… but all in all, a stimulating afternoon. Thanks very much for organising it, JISC.

Semantically richer PDF?

PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.

In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.

The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).

RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).

PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.

Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!

What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.

So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).

Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.

PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.

So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?

References:

Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/

Wednesday, 1 April 2009

An update on my data recovery efforts

You may remember that after our Christmas party late last year, I wrote a blog post offering to have a go at recovering some old files, if anyone was interested. A half dozen or so people got in touch, one with 30 or so old Mac disks, someone with a LaTeX version of a thesis on an old Mac disk, a colleague (who started all this, really) with a dissertation on an old Amiga disk, and someone with a CD from an Acorn RISC PC, plus a few others.

I frightened some off by giving them a little disclaimer to agree to, but others persisted and sent me their media. So, problem number 1, how to read old Amiga and Apple floppy disks? Both of these have a different structure from PC-compatible floppies, for example the old Mac disks store more information on the outer tracks than the inner ones, thus packing more data onto the disk.

The answer seemed to be a special controller for a Windows computer, that links to a standard floppy drive. The controller is called the Catweasel Mk4, from a German company, Individual Computers. We ordered one and it arrived quite quickly, well before I had managed to borrow a Windows box to experiment with. The card didn't physically fit in the first system, and then there was a long wait while we found a spare monitor (progress was VERY slow when I had to disconnect and re-use my Mac monitor). Then bid-writing intervened.

Eventually we got back to it, but had lots of problems configuring the controller properly; the company was quite good at providing me with advice. Finally a couple of days ago I finally got the config file right.

I wasn't using any of the contributed disks for testing, but instead some old DOS disks that I had from my days in Dundee (1992-4). Early on we did manage to read these, with rather a lot of errors, but lately we have got zero good sectors off these disks. I'm still not sure why; I'm inclined to blame my attempt to read one of the disks using the Windows commands on the same drive (there's a pass-through mode); I never managed to get any DOS disks to read a single track after that!

Well, today I stuck in an old Mac disk I had... and lo and behold, it was reading with a fair proportion of good sectors. So, let's try the Amiga disk: 100% good (well, maybe one bad sector). And the Mac disk with the LaTeX file on it: pretty good, 1481 good sectors out of 1600.

The problems aren't over yet. All you get from the ImageTool3 program that works through this controller is a disk image. So now I'm looking for a Windows (or Mac) utility to mount an Amiga file system, so we can copy the files out of it. And ditto for the Mac file system (written circa 1990 I think; I'm assuming it's HFS, but don't know).

At that point, depending on the amount of corruption, the Mac job should be pretty much done; my contributor still understands LaTeX and can probably sort out his old macros. For the Amiga files, there will be at least one further stage: identifying the file formats (CV.asc is presumably straight ASCII text, but the dissertation may be in a desktop publishing file format), and then finding a utility to read them.

It's been slow but interesting, and I've been quite despondent at times (checking out the data recovery companies, that sort of thing). But now I'm quite excited!

These two contributors have got their disk images back, and may have further ideas and clues. But can you help with any advice?

Digital Curation Blog