Wednesday, 13 January 2010

Scholarly HTML would be nice, but...

I'm quite interested in the idea of Scholarly HTML, as espoused in Pete Sefton's blog, and I've commented on some of Peter Murray Rust's hamburger PDF comments previously (although I do think a lot of people confuse wild PDF with well-made, should one say Scholarly PDF). I've always been slightly worried by one thing though.

A well-known advantage of PDF is that it pretty much assures I can save a document, share it, move it around etc and it will still be intact and readable. That's one of the reasons it's so popular.

Mostly we don't do that with HTML. Mostly we just point to it. But if I see an article these days, I want it on my computer if I'm allowed; this let's me study it at leisure, drop it in my Mendeley system, etc. As pointed out, that works a treat with PDF, and pretty well with Word or OpenOffice documents as well. This applies even where the document is quite heavily compound, with many embedded images, tables etc.

But if I try saving a HTML document to my hard disk, nothing very standard happens. OK, if I use Safari on my Mac, I get a .webarchive file, which is quite nice as I can do all the things with it that I could do with a PDF and Word etc, and when I open it later it will be as it was before, with all the images in place. But neither IE nor Firefox seem capable of opening a .webarchive file.

If I try saving the same article from Firefox, I get a .html file with the main article in it, and a directory with associated files in it (eg images). Safari does seem capable of opening this combination, but it's pretty ugly, and hard to move around. I haven't tried IE as I don't have easy access to it.

Is there in existence or development a standard approach to packaging the HTML and associated files that would be as convenient as the .webarchive, but usable across all browsers? If so, Scholarly HTML would be that little bit closer!


  1. Thanks for the link Chris, pity my website is down due to a major failure at my service provider :-(

    I will follow up with a full post, but here are some quick points.

    Note that in our work we promote and enable Scholarly HTML but we also provide PDF. The online HTML version can be much richer, though, once we start sorting out approaches to data visualization and interactivity.

    You are right. This save-as-html thing has been a huge problem for years. OAI-PMH could be used as the technical basis for a solution but I can't see that happening in the mainstream browsers so we are left with solutions that work around the fringes for now.

    I think the best bet for the future might be the HTML5 'save as web app' approach which I have heard works on iPhones and in some other browsers, but not apparently any of the ones I use.

    Right now Zotero works pretty well for grabbing HTML - it saves articles for offline use, and you can sync them between machines, although not yet to phones and the like.

    Another approach that would work if we had the infrastructure would be to use a single-document Atom + an offline reader, an approach I first heard suggested by Tony Hirst for extracting 'pure' content out of wordpress; obviously depends on repository support.

    Finally (and this doesn't solve your problem) we have been working on ways to package HTML (and PDF, and data etc) for repository deposit. We are about to unveil a new system that uses IMS content packages which are Zip files with a manifest and an 'organizer' for navigation. The idea is to deposit a package which has the same content in multiple formats, then use a repository plugin to display the HTML version embedded in the item page - you would still not be able to save the HTML for local use, but you could grab the Zip file.

  2. Thanks Peter. I hadn't heard of the HTML5 "save as web app" approach; I'll go look it up.

    I don't use Zotero much, but have settled on Mendeley, which wants to see PDF, DOC or ODF. Some web-based documents I've resorted to "Print as PDF" on the Mac, which seems to work but feels a bit wrong.

    We have a related problem with our own preservation strategy. We want to use the University's repository for preservation copies, but many of our documents exist only as web pages (well, there might be original source documents, but they don't necessarily reflect all the editing that's gone on since).

    Re your packaging solution... why IMS? Why not OAI-ORE? Or even bagit?

  3. Hi Chris. I was going to suggest printing web pages as PDFs, but see you beat me to it! I wonder why you think it feels wrong to do this, but not, presumably, for a Word document? If a proper print CSS file is used for the web page (admittedly this is often not the case) then you can have a nice PDF document without the web page clobber.

  4. @paulmilne the context in the HTML -> PDF comment was that Mendeley won't accept HTML, so I have to go to PDF. The feeling wrong is because HTML really does have capabilities to include more functionality than PDF does (at the moment); in particular to include RDFa and participate in the linked data revolution. In practice of course, not many people (yet) do this even for HTML, and it wouldn't take much for PDF to contain RDFa as well.

  5. Chris, just wanted to update you. Mendeley should have the HTML page snapshot feature ready soon.

    It's a good point about PDFs not being full participants in the RDFa/linked data world. It's the same issue with video - you can't easily link to parts of the file like you can with HTML.

  6. Thanks for bringing up the problem of saving HTML, Chris. Personally, I'm not very happy with the solution of printing HTML as PDF--not simply because of the inevitable loss in formatting, but also because of the loss in the capacity to network documents together automatically.

    I mentioned on Peter Sefton's blog one project that exploits HTML's inherent connectivity, ThoughtMesh. This open-source academic publishing tool makes it easy for authors to download their HTML in a standard format.

    More importantly, I think, a downloaded ThoughtMesh document maintains the same connectivity of its online equivalent. ThoughtMesh uses automatically generated keywords to find related essays across the Web, and this feature works even when you load an essay directly from your harddrive into your browser. All you need is an Internet connection--no server is involved. This makes it easy, for example, to present research in progress that hasn't yet been published live on the Web.

    PDF does use JavaScript behind the scenes, but PDF's closed nature has deterred developers from exploiting its scriptability to make it more sociable.

  7. The situation with regards the convenient saving of Web pages is far from ideal at the moment. The Firefox solution of 'HTML file plus folder of dependencies' is the most widely supported: the saved file should open in any browser, and most browsers can save in this manner.

    There are a number of competing single-file formats for Web pages, but the ones I know of are only supported by one or two browsers. You have already mentioned Safari's .webarchive format, which has good support on Mac OS X but only limited support elsewhere. The main alternative is MIME HTML (RFC 2557), which borrows the attachment mechanism from e-mail; this is fully supported by IE and Opera, although there are extensions available or in progress for other browsers. Konqueror's .war format is essentially the 'HTML page plus dependencies' idea converted to a tarball, so once decompressed can be read in other browsers.

    Were it not for the fact that no browser I know supports saving Web pages in this way, possibly the widest support would be for HTML pages that incorporated their dependencies as data URIs (although with external CSS and JavaScript these might as well be converted to use inline style and script blocks). The major block to this is that IE 8 sets a very low maximum file size for data URIs.


Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.