In this article, I consider the practical issues with archiving of scientific material placed on the web; I will describe the motivation for doing this, the background and consider the various mechanisms for doing so.
As part of our work on knowledgeblog , we have been investigating ways of injecting the formal technical aspects of the scientific publication process into this form of publication. The reasons for this are myriad: if the scientist can control the form, they can innovate in their presentation how they choose; the publication process itself becomes very simple and straight-forward (as opposed to the authoring, which is as hard as it ever way). Finally, it means that scientists can publish as they go, as I have done and am doing on my work with Tawny-OWL. This latter point has many potential implications: firstly, it makes science much more interactive — scientists can publish things that they are not clear on, early ideas and can (and I do) get feedback on this early; secondly, it should help to overcome publication bias as it is much lighter-weight than the current publication process. Scientists are more likely to publish negative results if the process is easy and not expensive. And, lastly, it can help to establish provenance for the work; if every scientists published in this way, scientific fraud would be much harder, as a fraudulant scientist would have to produce a coherent, faked set of data from the early days of the work.
However to achieve this, posts must still be available. The scientific record needs to be maintained. Now, this should not be an issue. I write this blog in Asciidoc , and rarely use images, so the source is quite small. In fact, since I moved to WordPress in 2009 , it totals about 725k; so it would fit on a floppy, which is a crushing blow to my ego. So, how easy is it to archive your content?
The difficulty here is that there is no obvious person to do this. Like many universities, I have access to an eprints archive. Unfortunately, this is mainly used for REF, and has no programmatic interface. The university also has a LOCKKS box. However, this is not generally available for the work that the University staff has produced, but journals that the University has bought; so I have to give my work away to a paywall publisher, or pay lots to an open access publisher to access this.
Another possibility would be to use Figshare. Now, I have some qualms about Figshare anyway; it appears to be a walled garden, the Facebook of science. Others, however, do not worry about this and are using Figshare. Carl Boettiger , for instance archives his note book on Figshare. But there is a problem: consider the 2012 archive; it is a tarball, with Markdown files inside; I know what to do with this, but many people will not. And it is only weakly linked to the original publication link. Titus Brown had the same idea , and claiming the added value of DOIs, something I find dubious . Again, though, the same problem; Figshare archives the source. The most extreme example of this comes from Karthik Ram who has published an Git repository; unsurprisingly, it is impossible to interact with as a repo.
Figshare likes to make great play of the fact that it is backed by CLOCKKS — this is a set of distributed copies maintained by some research libraries. Now, it might seem sensible that CLOCKKS would offer this service (at a price, of course) to researchers. Perhaps they do. But the website reveals nothing about this. And, although, I tried they did not respond to emails either. Rather like DOIs, the infrastructure is build around scale; in short, you need a publisher or some other institution involved; all very well, but this contradicts the desire for a light-weight publication mechanism. There is a second problem with CLOCKKS; it is a dark archive, that is, its content only becomes available to the public after a “trigger event”; the publisher going bust, the website going down and so on. Now data which is on the web and, critically, archived by someone other than the author essentially becomes non-repudiable and time-stamped. I can prove (to a lower-bound) when I said something. And you can prove that I said something even if I wish I hadn’t. In a strict sense, this is true if the data is in CLOCKKS; but in a practical sense, it is not, as checking when and what I said becomes too much of a burden to be useful.
So, we move onto web archiving. The idea of web archiving is attractive to me for one main reason; it is not designed for science. It is a general purpose, commodity solution, rather like a blog engine. If one thing scientific publication needs more than anything, it is to move the technology base away from bespoke and toward commodity.
One of the most straight-forward solutions for web archiving is WebCite; the critical advantage that this has is that it provides an on-demand service. I have been using it for a while to archive this site; greycite  now routinely submits new items here, if we can extract enough metadata from them. The archiving is quick, rapid and effective. The fly-in-the-ointment is that WebCite has funding issues and is threatened with closure at the end of 2013. The irony is that it claims it needs $25,000 to continue. Set against the millions put aside for APCs , or the thousands NPG claims is necessary to publish a single paper , or the millions that ACM spends supporting its digital library , this is small beer, and it shows the lack of seriousness with which we take web archiving. I hope it survives; if it does, Gunther Eysenbach, who runs it, tells me that the plan to expand the services they offer. It may yet become the archiving option of choice.
I have been able to find no on-demand alternative to WebCite. However, there are several other archives available. I have been using the UK Web Archive for a while now. I first heard about this service, irony or ironies, on the radio. Since I first used it to archive knowledge blog and later used it to archive this site, the process has got a lot easier. No longer do I need to send signed physical copyright permission; first it was electronic (email I think). It now appears that the law is changing to allow them to archive more widely (the BBC covered this in a story, categorized under “entertainment and arts” and which is largely focused on Stephen Fry’s tweets), although this will be a dark archive. Currently, this journal has been archived only once; from my other sites, it appears that they have a six month cycle. So, while this provides good digital preservation, it is a less good solution from the perspective of non-repudiablility; there is a significant gap before the archive happens, and a slightly longer one till the archive is published.
The UKWA is, as the name suggests, is UK specific. Another solution is to use, of course, archive.org, which might be considered to be the elephant in the room for web archiving. Unlike the UKWA, they don’t take submissions, but just crawl the web (although I suspect that the UKWA will start doing this also now). Getting onto a crawl can, therefore, be rather hit-and-miss. Frustratingly, they do have an “upload my data” service, which you can access through a logged in account; but not an “archive my URL” service. Again, a very effective resource from a digital preservation resource, but with similar problems to the UKWA from a point-of-view of non-repudiablilty. The archives take time to appear; in my experience, somewhat longer than the UKWA. I have also contacted their commercial wing, http://archive-it.org. Their software and the crawls that the offer could easily be configured to do the job, but unfortunately, they are currently aimed very much at the institutional level: their smallest package provides over around 100Gb of storage; this blog can be archived in around 130Mb (this is without deduplication which would save a lot); even a fairly prolific blogger comes in at around 250Mb. The price, unfortunately, reflects this. Although, again, it is on a par with my yearly publication costs, so is well within an average research budget.
Of course, these solutions are not exclusive; with greycite we have started to add tools to support these options. For instance, kblog-metadata , now supports an “archives” widget which is in use on this page; this links directly through to all the archives we know about. For individual pages, these are deep links, so you can see archived versions of each article straight-forwardly. The data comes from greycite, which we discover by probing; we may move later to using Mementos. greycite itself archives metadata about webpages, so we link to this also. As a side effect, these also mean that each article is submitted to greycite, which in turn causes archiving of the page through WebCite. Likewise, archive locations are returned within the BibTeX downloads, which is useful for those referencing sites.
Finally, greycite now generates pURLs — these are two-step resolution URLs which work rather like DOIs (or actually DOIs operate like pURLs, since as far as I am aware, pURLs predate the web infrastructure for DOIs). These resolve directly to the website in question. With a little support greycite can track content as and if it moves around the web; even if this fails, and an article disappears, greycite will redirect to the nearest web archive.
In summary, there is no perfect solution available at the moment, but there are many options; in many cases, archiving will happen somewhat magically. As we have found with many other aspects of author self-publishing on the web, it is possible to architecturally replicate many of the guarantees provided by the scientific publication industry through the simple use of web technology. Tools like greycite and kblog-metadata are useful in uncovering the archives that are already there, and linking these together with pURLs. Taken together, I have a reasonable degree of confidence that this material will be available in 10 or 50 years time. Whether anyone will still be reading it, well, that is a different issue entirely.