Digital Curation Blog: #lotf09

Monday, 6 April 2009

Libraries of the Future: SourceForge as Repository?

In his talk (which he pre-announced on his resumed blog), Peter Murray-Rust (PMR) suggested (as he has done previously) that we might like to think of SourceForge as an alternative model for a scholarly repository (“Sourceforge. A true repository where I store all my code, versioned, preserved, sharable”). I’ve previously put forward a related suggestion, stimulated by PMR’s earlier remarks, on the JISC-Repositories email list, where it got a bit of consideration. But this time the reaction, especially via the Twitter #lotf09 feed, was quite a bit stronger. Reactions really fell into two groups: firstly something like: SourceForge was setting our sights way too low; it’s ugly, cluttered by adverts, and SLOOOOOWWW (I can’t find the actual quote in my Twitter search; maybe it was on Second Life). Secondly, the use cases are different, eg @lescarr: “Repositories will look like sourceforge when researchers look like developers” (it might be worth noting that PMR is a developer as well as a researcher).

Through my JISC-Repositories explorations, I was already beginning to be convinced by the “different use-case” argument, although it was not articulated that way. SourceForge acts as a platform providing a range of services and utilities supporting the development of software by disparate teams of developers. Is there a parallel use case in the repositories area?

The closest I could come was some of my suggestions from my Research Repository System posts, for moving the repository upstream in the researcher’s workflow, and specifically helping authors write multi-author articles. This is a reasonable parallel in several ways, but there are some significant differences.

Software source files are simple text files, completely independent of whatever system is used to edit them. The source of a software system comprises many individual software modules, each of which is usually in a separate file and can be edited independently. There are strong file name conventions, and complex “recipe” systems (eg make files) for converting them from source to actionable programs, combining them with software libraries and system-wide variables. If one developer changes one module, the system “knows” how to rebuild itself.

Journal articles, by contrast, are mostly written using proprietary (sometimes open source) office systems which create and edit highly complex files whose formats are closely linked to the application. Apart from some graphic and tabular elements, an article will usually be one file, and if not, there is no convention on how it might be composed. Workflow for multi-author articles is entirely dependent on the lead author, who will usually work with colleagues to assign sections for authoring, gather these contributions, assemble and edit them into a coherent story, circulating this repeatedly for comments and suggested improvements (sometimes using tracked change features).

Using LaTeX and include files, it might be possible to re-create a similar approach to that used in software, but for the human reader, the absence of a single editorial “voice” would be very apparent. And anyway, most of us just don’t want to work that way. I got out of the habit of writing in LaTeX around 20 years ago, when word processors began to arrive on workable desktop computers (Word on the early Macs), and I’d prefer not to go back.

Oh, and while it’s a “category mistake” to confuse SourceForge with any particular version control system, it looks to me (and at least one respondent on JISC-Repositories) as if distributed systems like git are fitter for purpose (in any kind of distributed authoring system) than their more centralised companions like VCS, SVN etc.

Well, is PMR’s suggestion of SourceForge as a model for scholarly repositories entirely a dead duck? I think it probably is, but I do hope that those who design repository systems, and indeed virtual research environments, do think hard about the capabilities of such systems when thinking through their responses to user requirements. And particularly where those requirements relate to what the researchers and authors need, rather than their institutions, libraries or readers.

Libaries of the Future event

I attended the JISC Libraries of the Future event last Thursday, and in the end quite enjoyed it. I was glad it was late starting, as I managed to get myself lost on the way from the station (RTFM, Chris!). Maybe I was too hot and bothered to pay the right kind of attention, but I’m afraid I found most of the presentations rather too bland for my taste. I enjoyed the Twitter back-chat (Twitter hashtag #lotf09, if you’re interested), but found that participating in it reduced, or perhaps changed my concentration on what was being said (you have to listen hard, but you also have to type your tweet, and concentration dropped off in the latter). This was the first time I have seen the Twitter channel (plus Second Life chat) projected near-live above the speakers; it perhaps helped make Twits like me better-behaved.

Even Peter Murray-Rust’s speech (see his blog where he wrote what he would say), which had plenty of passion, was less strong on content than I was hoping. But 15 minutes is pretty limiting. I’ll comment on one aspect of Peter’s remarks (SourceForge as exemplar for repositories) separately.

Probably the best moments of the day came in responses from Citizen Darnton of Harvard to audience questions (Wikipedia); Darnton had spoken on the citizen’s perspective. Mis-quoting him a bit, he seemed to be suggesting that librarians might exercise their market power by cancelling all expensive journals, apparently Max Planck Institutes had done just that. Well that would certainly undo the myth that libraries had been doing nothing for scientists! (CR: Perhaps a librarian might allocate say 2/3 of the original licence fee for paid access to articles on a first-come, first-served basis, thus both making a real budget saving, proving whether or not scientists really used the journals, and providing incentives to use the OA versions! OK, the politics are a bit tough…)

Questioned after the Google man had explained how excellent it was that Google should digitise books and make them available, Citizen Darnton again made some interesting comments, this time a hint at the monopoly risks inherent in the Google Book Settlement. I won’t attempt to summarise his remarks as I’d be sure to get them wrong, but he certainly had nightmares thinking of the possible down-sides of that settlement, and the effective monopoly concentration of copyright that may ensue; the loss of control of the world’s knowledge from libraries to Google. Here are some tweets, in reverse order (I am a bit unsure of the ethics here):

• tomroper: #lotf09 monopoly. Contradiction between private property and public good. Google's mission (organise world's knowledge)
• cardcc: Comment on serious problem of Google Book Settlement: sounds bad; effectively going to Google control. #lotf09
• scilib: seems to be a common theme where libraries/academics surrender copyright to tech/powerful orgs... and then regret it #lotf09
• Gondul: Hand grenade question. We (libraries) are concerned about Google, yet Google unaware of us. #lotf09
• bethanar: #lotf09 Robert Darnton 'google is obviously a monopoly ... no other company can mount a rival digitisation project'
• tomroper: Robert: Google greatest ever seen; but orphan books, 1923-64 publications,; revenue to go to authors and publishers #lotf09
• cardcc: Q: Control of world's knowledge moving out of control of libraries, should we be worried? Harvard man has nightmares on Google #lotf09

Two other things stick in my memory. First, Chris Gutteridge (main author of the ePrints.org repository software platform) made a passionate interjection (or two) on the iniquities of the current scholarly communications system. It was largely what most people there already knew: “We do the research, we write the articles, we give them to the publishers, we do the peer review, then we have to buy them back at vast cost.” But knowing it, and being reminded with such passionate force, are different.

And finally, there was a consultant whose name I didn’t catch, a serial interjector of the kind who doesn’t believe he needs a microphone even if the event is being recorded, web cast and on Second Life, who asked something like “how come libraries have been sleeping on the job, allowing organisations like Google to steal the initiative and their territory?” At a facile level, maybe not a bad question, but surely it doesn’t take much analysis to notice the critical differences for this kind of innovation? Unlike libraries, Google has massive capital, the ability to invest it speculatively, makes huge surpluses on its day to day activities, can employ the best and most expensive software engineers (and lawyers), and a willingness to accept high risk for high return. By contrast, Universities and their libraries mostly have meagre capital, have a desire to invest as much as possible into their research and teaching, make the least possible surpluses consistent with their funding rules (see previous clause), pay their staff on union rates, (used to be) well under the market rate, and are required to be risk-averse (else there would be as many failed universities as failed software companies). As an example, when did you last see a library that was able to give its staff one day a week for their own projects? Much more like 6 days work for the price of 5, from what I see. Via Twitter, Lorcan Dempsey also suggested “Why didn't libraries do it? Libraries are institution-scale; difficult to act at other scales. Google is webscale”.

And no doubt there was more that my failing memory has discarded… but all in all, a stimulating afternoon. Thanks very much for organising it, JISC.

Digital Curation Blog

Monday, 6 April 2009

Libraries of the Future: SourceForge as Repository?

Libaries of the Future event

Creative Commons

Blog Archive

Contributors

Labels