In this article, I consider the practical issues with archiving of scientific material placed on the web; I will describe the motivation for doing this, the background and consider the various mechanisms for doing so.

As part of our work on knowledgeblog (http://knowledgeblog.org), we have been investigating ways of injecting the formal technical aspects of the scientific publication process into this form of publication. The reasons for this are myriad: if the scientist can control the form, they can innovate in their presentation how they choose; the publication process itself becomes very simple and straight-forward (as opposed to the authoring, which is as hard as it ever way). Finally, it means that scientists can publish as they go, as I have done and am doing on my work with Tawny-OWL. This latter point has many potential implications: firstly, it makes science much more interactive — scientists can publish things that they are not clear on, early ideas and can (and I do) get feedback on this early; secondly, it should help to overcome publication bias as it is much lighter-weight than the current publication process. Scientists are more likely to publish negative results if the process is easy and not expensive. And, lastly, it can help to establish provenance for the work; if every scientists published in this way, scientific fraud would be much harder, as a fraudulant scientist would have to produce a coherent, faked set of data from the early days of the work.

However to achieve this, posts must still be available. The scientific record needs to be maintained. Now, this should not be an issue. I write this blog in Asciidoc (http://process.knowledgeblog.org/167), and rarely use images, so the source is quite small. In fact, since I moved to WordPress in 2009 (http://www.russet.org.uk/blog/1175), it totals about 725k; so it would fit on a floppy, which is a crushing blow to my ego. So, how easy is it to archive your content?

The difficulty here is that there is no obvious person to do this. Like many universities, I have access to an eprints archive. Unfortunately, this is mainly used for REF, and has no programmatic interface. The university also has a LOCKKS box. However, this is not generally available for the work that the University staff has produced, but journals that the University has bought; so I have to give my work away to a paywall publisher, or pay lots to an open access publisher to access this.

Another possibility would be to use Figshare. Now, I have some qualms about Figshare anyway; it appears to be a walled garden, the Facebook of science. Others, however, do not worry about this and are using Figshare. Carl Boettiger (http://www.carlboettiger.info/), for instance archives his note book on Figshare. But there is a problem: consider the 2012 archive; it is a tarball, with Markdown files inside; I know what to do with this, but many people will not. And it is only weakly linked to the original publication link. Titus Brown had the same idea (http://ivory.idyll.org/blog/posting-blog-entries-to-figshare.html), and claiming the added value of DOIs, something I find dubious (http://www.russet.org.uk/blog/1849). Again, though, the same problem; Figshare archives the source. The most extreme example of this comes from Karthik Ram who has published an Git repository; unsurprisingly, it is impossible to interact with as a repo.

Figshare likes to make great play of the fact that it is backed by CLOCKKS — this is a set of distributed copies maintained by some research libraries. Now, it might seem sensible that CLOCKKS would offer this service (at a price, of course) to researchers. Perhaps they do. But the website reveals nothing about this. And, although, I tried they did not respond to emails either. Rather like DOIs, the infrastructure is build around scale; in short, you need a publisher or some other institution involved; all very well, but this contradicts the desire for a light-weight publication mechanism. There is a second problem with CLOCKKS; it is a dark archive, that is, its content only becomes available to the public after a “trigger event”; the publisher going bust, the website going down and so on. Now data which is on the web and, critically, archived by someone other than the author essentially becomes non-repudiable and time-stamped. I can prove (to a lower-bound) when I said something. And you can prove that I said something even if I wish I hadn’t. In a strict sense, this is true if the data is in CLOCKKS; but in a practical sense, it is not, as checking when and what I said becomes too much of a burden to be useful.

So, we move onto web archiving. The idea of web archiving is attractive to me for one main reason; it is not designed for science. It is a general purpose, commodity solution, rather like a blog engine. If one thing scientific publication needs more than anything, it is to move the technology base away from bespoke and toward commodity.

One of the most straight-forward solutions for web archiving is WebCite; the critical advantage that this has is that it provides an on-demand service. I have been using it for a while to archive this site; greycite (http://greycite.knowledgeblog.org) now routinely submits new items here, if we can extract enough metadata from them. The archiving is quick, rapid and effective. The fly-in-the-ointment is that WebCite has funding issues and is threatened with closure at the end of 2013. The irony is that it claims it needs $25,000 to continue. Set against the millions put aside for APCs (http://www.rcuk.ac.uk/media/news/2012news/Pages/121108.aspx), or the thousands NPG claims is necessary to publish a single paper (http://www.guardian.co.uk/science/2012/jun/08/open-access-research-inevitable-nature-editor), or the millions that ACM spends supporting its digital library (http://www.russet.org.uk/blog/1924), this is small beer, and it shows the lack of seriousness with which we take web archiving. I hope it survives; if it does, Gunther Eysenbach, who runs it, tells me that the plan to expand the services they offer. It may yet become the archiving option of choice.

I have been able to find no on-demand alternative to WebCite. However, there are several other archives available. I have been using the UK Web Archive for a while now. I first heard about this service, irony or ironies, on the radio. Since I first used it to archive knowledge blog and later used it to archive this site, the process has got a lot easier. No longer do I need to send signed physical copyright permission; first it was electronic (email I think). It now appears that the law is changing to allow them to archive more widely (the BBC covered this in a story, categorized under “entertainment and arts” and which is largely focused on Stephen Fry’s tweets), although this will be a dark archive. Currently, this journal has been archived only once; from my other sites, it appears that they have a six month cycle. So, while this provides good digital preservation, it is a less good solution from the perspective of non-repudiablility; there is a significant gap before the archive happens, and a slightly longer one till the archive is published.

The UKWA is, as the name suggests, is UK specific. Another solution is to use, of course, archive.org, which might be considered to be the elephant in the room for web archiving. Unlike the UKWA, they don’t take submissions, but just crawl the web (although I suspect that the UKWA will start doing this also now). Getting onto a crawl can, therefore, be rather hit-and-miss. Frustratingly, they do have an “upload my data” service, which you can access through a logged in account; but not an “archive my URL” service. Again, a very effective resource from a digital preservation resource, but with similar problems to the UKWA from a point-of-view of non-repudiablilty. The archives take time to appear; in my experience, somewhat longer than the UKWA. I have also contacted their commercial wing, http://archive-it.org. Their software and the crawls that the offer could easily be configured to do the job, but unfortunately, they are currently aimed very much at the institutional level: their smallest package provides over around 100Gb of storage; this blog can be archived in around 130Mb (this is without deduplication which would save a lot); even a fairly prolific blogger comes in at around 250Mb. The price, unfortunately, reflects this. Although, again, it is on a par with my yearly publication costs, so is well within an average research budget.

Of course, these solutions are not exclusive; with greycite we have started to add tools to support these options. For instance, kblog-metadata (http://knowledgeblog.org/kblog-metadata), now supports an “archives” widget which is in use on this page; this links directly through to all the archives we know about. For individual pages, these are deep links, so you can see archived versions of each article straight-forwardly. The data comes from greycite, which we discover by probing; we may move later to using Mementos. greycite itself archives metadata about webpages, so we link to this also. As a side effect, these also mean that each article is submitted to greycite, which in turn causes archiving of the page through WebCite. Likewise, archive locations are returned within the BibTeX downloads, which is useful for those referencing sites.

Finally, greycite now generates pURLs — these are two-step resolution URLs which work rather like DOIs (or actually DOIs operate like pURLs, since as far as I am aware, pURLs predate the web infrastructure for DOIs). These resolve directly to the website in question. With a little support greycite can track content as and if it moves around the web; even if this fails, and an article disappears, greycite will redirect to the nearest web archive.

In summary, there is no perfect solution available at the moment, but there are many options; in many cases, archiving will happen somewhat magically. As we have found with many other aspects of author self-publishing on the web, it is possible to architecturally replicate many of the guarantees provided by the scientific publication industry through the simple use of web technology. Tools like greycite and kblog-metadata are useful in uncovering the archives that are already there, and linking these together with pURLs. Taken together, I have a reasonable degree of confidence that this material will be available in 10 or 50 years time. Whether anyone will still be reading it, well, that is a different issue entirely.

Bibliography

3 Comments

  1. Carl says:

    Phil, thanks for this, it’s a great piece. If you haven’t seen it, you might enjoy this interview with Geoffery Bilder, from CrossRef, http://blogs.plos.org/mfenner/2009/02/17/interview_with_geoffrey_bilder/

    I think Geoff does a good job pointing out the social contract dimension of successful archiving, and I’d be quite curious what you think about his arguments.

    Thanks for the mention and the feedback about archiving my notebook — it’s something I’ve asked myself. I went with markdown files since they are plain text (I could have given them .txt extension instead I suppose), and seemed the simplest to read even if you had no background or rendering software. I have recently been thinking that I should be archiving my HTML directly instead, not least because it is a more standard format and contains metadata not found in the markdown. On the downside, archiving a whole web directory with its CSS and JS, etc, seems cumbersome. Any thoughts on which is the lesser evil?

    While the figshare online display leaves something to be desired in archiving html files, and more to be desired in archiving git repos (e.g. vs Github, where interaction is easy) it seems those are UI problems more than archival problems. I certainly see the advantage of being able to download and interact with Karthik’s git repo if I wanted to explore the history, even if it would be preferable to use an online-based exploration. (Minor comment, you missed the citation in your link to Karthik’s archive.)

    Anyway, very good points and I agree it feels like we have no perfect solutions at this time. I see having copies of my notebook on Github and figshare more as hedging my bets than an ultimate solution. Ensuring the existence of a copy is hard enough, but I think ensuring it has a persistent identifier is even harder (which gets back to Geoffery’s arguments I guess).

  2. Phillip Lord says:

    Carl

    I’m going to reply briefly here, as I think a full reply requires more space.

    “I think Geoff does a good job pointing out the social contract dimension of successful archiving”

    In this, Geoff is entirely correct. With Greycite, for instance, we cannot currently offer any form of service guarantee. As Geoff points out the technological base of crossref DOIs are nothing special.

    “I have recently been thinking that I should be archiving my HTML directly instead, not least because it is a more standard format and contains metadata not found in the markdown.”

    In 100 years time, if anything of us survives at all, it will be HTML. With any form of archiving, is is not simplicity, but sameness that counts. People in the future will be able to recover HTML because it will be worth the effort of doing so as this will give them access to millions of documents.

    “On the downside, archiving a whole web directory with its CSS and JS, etc, seems cumbersome.”

    Not your problem. People have been working on how to archive websites. Let them do this for you.

    “While the figshare online display leaves something to be desired in archiving html files, and more to be desired in archiving git repos (e.g. vs Github, where interaction is easy) it seems those are UI problems more than archival problems.”

    No. The problem is that the link between your Figshare DOI and what you are archiving is completely ad-hoc. You want to archive your website; you have not done so. You have archived something else entirely.

    “I certainly see the advantage of being able to download and interact with Karthik’s git repo if I wanted to explore the history, even if it would be preferable to use an online-based exploration.”

    Same problem. And why have you done this? Because of the “magic” quality of the DOI. Figshare says “make your data citable”. But this is nonsense. Your website was already citable. Giving a DOI to something else does not make it more so.

    Of course, if you have minted a DOI, have it resolve to your website, and have a third-party escrow system which monitored your website, and moved the pointers to figshare if your website went 404, that would be useful. Better still, actually, would be to point to archive.org, or a web archiving solution, so people could view things. This is what greycite does (although with purls).

    “(Minor comment, you missed the citation in your link to Karthik’s archive.)”

    Sorry. It was there, but was being too clever with my hyperlinking. This will fix itself presently.

    “Ensuring the existence of a copy is hard enough, but I think ensuring it has a persistent identifier is even harder (which gets back to Geoffery’s arguments I guess).

    I think it misunderstands Geoffs arguments. Now, I think two-step identifiers have major problems for other reasons (which I will blog about at some point!), but as Geoff says, a two step identifier is no use at all *unless* someone updates the endpoint; I would add the second rider, which is and the endpoint is correct in the first place.

    If you want persistent identifiers for your blog, my suggestion: generate a partial redirect purl, using http://www.purl.org. You do use a partial redirect, so you only need to do this once. If you move domain, you can update the endpoint. Then add a line to your will and testament saying that in the event of your death, this should be altered to point to http://wayback.archive.org/web/http://www.carlboettiger.info/. As Geoff says the solution is partly technical, partly social.

  3. Carl says:

    Hi Phil,

    Thanks for the reply, this is very helpful. My only quibble is that you attribute the motivation of my archive or Karthik’s Github archive to having a DOI. Personally, I agree entirely that the idea that this somehow makes it ‘citable’ is silly. My motivation, and possibly that of others, is almost entirely based on CLOCKSS backup. It is too easy to make the objection to someone providing data/content on either a personal site or on Github that there is very little to guarantee that content won’t be lost forever. Clockss isn’t perfect but it is a lot better.

Leave a Reply