A while back, I submitted a grant to JISC on digital preservation. The basic idea was to move a set of files that I had as Word docs, post them all on the knowledgeblog platform. The practical upshot of this is that the files, instead of rusting, become accessible to the world at large and, also, they also get digitially preserved by the various web preservation engines around. We called this digital preservation by stealth; putting something on the web is useful anyway, the preservation occurs as a happy by-product. And along the way, we get stats on whether the content was actually used by anyone.

Nice idea, I thought. Still, the grant bounced. There were several reasons for this: the preservation was so stealthy that one of the reviewers could not see it all at; another thought that the format chosen (HTML) was pretty dubious for preservation.

In one way, I found this response to be a bit dubious. After all, in two hundred years time what is more likely to be readable? HTML or the NLM DTD which, as Martin Fenner points out (n.d.a/) very few people actually use. However, in another way I can appreciate the problem. Take, for example, my own Kcite plugin (n.d.b) in the last year, I have added support for multiple reference formats. But, I have done this in Javascript. While this has some advantages, particularly for page load times, it does reduce the preservability of my web page; Javascript is complex, and may well not run in 5 years time, let along two hundred. Although Javascript (or at least the ability to embed scripts) is a feature of HTML, it is not a good feature from a preservation point of view. Not all HTML is created equal as it where. KCite does make some attempts to cope with this — in the absence of Javascript, readers see a visible and clickable URI direct to the resource in-place of the citation, and lose the reference metadata.

I had thought about using ScholarlyHTML which was announced several years ago; this comes from a similar understanding that HTML can be used well or badly for academic publication. Interestingly, their website uses a similar idea of to knowledge blog; collaborative editing out of band, followed by unchanging, date-stamped publication to a blog engine. At least this way the idea, although as Peter Sefton says on the front page “[That is, once I get the site established - I’m editing this page live to get started - ptsefton]”. But, the intention of ScholarlyHTML is largely to make content more explict; no bad thing, but now what I was after.

Instead, I have implemented what I descripe as “SimpleHTML”; alternatively, if you are of a more pejorative nature, you can call it “StupidHTML”. The idea is this; it is HTML which is a close to raw content as I can make it. Within Wordpress, the “content” is generated by an editor (either internal to Wordpress, or external for those using the XML-RPC publication), so there is little I can do about this. However, I can stop Wordpress from making the situation worse by adding its own complexity.

To enable this, I have make a couple of changes. Kblog-metadata (n.d.c) now generates SimpleHTML on request, and provides a link to it through the Download widget I have created; it should be visible on this page. And I have updated KCite to enable me to programmatically disable Javascript reference generation, and fallback to server-side PHP generation. As an example, for instance, you can see the SimpleHTML for a recent article here. Metadata added via kblog-metadata (n.d.d) is still present, but otherwise the web page looks remarkably similar to HTML that I wrote in 1994, after seeing Mosaic for the first time.

As a general principle for preservation, as I develop the knowledgeblog platform further, I will follow this rule. I will not be adverse to using active technologies such as Javascript. However, a “fallback” straight-forward HTML, preserving the raw content, will always be available.