Archive for the ‘Communication’ Category

Much has been said about overlay journals ( The idea is simple; the journal essentially becomes a selector, a channel, with the paper itself being hosted elsewhere, such as arXiv.

This holds a certain amount of attraction for me; I already post my new papers on arXiv. I have been posting them here also ( This works well, but is hampered by technology. Mostly I write papers in LaTeX, and I have written tools to make these suitable for WordPress (; these work well enough to publish an entire thesis ( However, the process of doing this is not slick ( For instance, when trying to publish one of my own papers, I have had problems as I used a theorem environment (10.1186/2041-1480-1-S1-S4). While PlasTeX is a nice tool, the key problem is that it is fundamentally a different interpreter from TeX. Eventually, perhaps, LuaTeX will get an HTML backend, but until this happens the system will always fail in some cases.

So, I wanted to investigate whether it was possible to build Overlay functionality into a personal publication framework, such as the WordPress installation I host these articles on. Well, it turns out combined with the tools that I have written for manipulating metadata (, it is relatively simple to do so; my first attempt at this is now available for my OWLED 2013 paper ( The title, authors (just me in this case), date, abstract and PDF link all come directly from arXiv. Full text is not available from arXiv — anyway it would suffer from all the issues described earlier; in the end, the PDF is probably the best representation of this paper. I have supplemented this with a plain English summary, something that I have wanted to do for years, but have not managed to start. If the reviewers will allow me to do so, I will also attach these when they become available.

The code for this is not quite ready to release yet: however, it will potentially work over any eprints repository, and I have connected it up to Greycite also (, so it can be used over any source that greycite can interpret.

All a little clunky, but I think that this is the future. The Journal is dead, Long Live the article.


Fixed DOI.


In this article, I consider the practical issues with archiving of scientific material placed on the web; I will describe the motivation for doing this, the background and consider the various mechanisms for doing so.

As part of our work on knowledgeblog (, we have been investigating ways of injecting the formal technical aspects of the scientific publication process into this form of publication. The reasons for this are myriad: if the scientist can control the form, they can innovate in their presentation how they choose; the publication process itself becomes very simple and straight-forward (as opposed to the authoring, which is as hard as it ever way). Finally, it means that scientists can publish as they go, as I have done and am doing on my work with Tawny-OWL. This latter point has many potential implications: firstly, it makes science much more interactive — scientists can publish things that they are not clear on, early ideas and can (and I do) get feedback on this early; secondly, it should help to overcome publication bias as it is much lighter-weight than the current publication process. Scientists are more likely to publish negative results if the process is easy and not expensive. And, lastly, it can help to establish provenance for the work; if every scientists published in this way, scientific fraud would be much harder, as a fraudulant scientist would have to produce a coherent, faked set of data from the early days of the work.

However to achieve this, posts must still be available. The scientific record needs to be maintained. Now, this should not be an issue. I write this blog in Asciidoc (, and rarely use images, so the source is quite small. In fact, since I moved to WordPress in 2009 (, it totals about 725k; so it would fit on a floppy, which is a crushing blow to my ego. So, how easy is it to archive your content?

The difficulty here is that there is no obvious person to do this. Like many universities, I have access to an eprints archive. Unfortunately, this is mainly used for REF, and has no programmatic interface. The university also has a LOCKKS box. However, this is not generally available for the work that the University staff has produced, but journals that the University has bought; so I have to give my work away to a paywall publisher, or pay lots to an open access publisher to access this.

Another possibility would be to use Figshare. Now, I have some qualms about Figshare anyway; it appears to be a walled garden, the Facebook of science. Others, however, do not worry about this and are using Figshare. Carl Boettiger (, for instance archives his note book on Figshare. But there is a problem: consider the 2012 archive; it is a tarball, with Markdown files inside; I know what to do with this, but many people will not. And it is only weakly linked to the original publication link. Titus Brown had the same idea (, and claiming the added value of DOIs, something I find dubious ( Again, though, the same problem; Figshare archives the source. The most extreme example of this comes from Karthik Ram who has published an Git repository; unsurprisingly, it is impossible to interact with as a repo.

Figshare likes to make great play of the fact that it is backed by CLOCKKS — this is a set of distributed copies maintained by some research libraries. Now, it might seem sensible that CLOCKKS would offer this service (at a price, of course) to researchers. Perhaps they do. But the website reveals nothing about this. And, although, I tried they did not respond to emails either. Rather like DOIs, the infrastructure is build around scale; in short, you need a publisher or some other institution involved; all very well, but this contradicts the desire for a light-weight publication mechanism. There is a second problem with CLOCKKS; it is a dark archive, that is, its content only becomes available to the public after a “trigger event”; the publisher going bust, the website going down and so on. Now data which is on the web and, critically, archived by someone other than the author essentially becomes non-repudiable and time-stamped. I can prove (to a lower-bound) when I said something. And you can prove that I said something even if I wish I hadn’t. In a strict sense, this is true if the data is in CLOCKKS; but in a practical sense, it is not, as checking when and what I said becomes too much of a burden to be useful.

So, we move onto web archiving. The idea of web archiving is attractive to me for one main reason; it is not designed for science. It is a general purpose, commodity solution, rather like a blog engine. If one thing scientific publication needs more than anything, it is to move the technology base away from bespoke and toward commodity.

One of the most straight-forward solutions for web archiving is WebCite; the critical advantage that this has is that it provides an on-demand service. I have been using it for a while to archive this site; greycite ( now routinely submits new items here, if we can extract enough metadata from them. The archiving is quick, rapid and effective. The fly-in-the-ointment is that WebCite has funding issues and is threatened with closure at the end of 2013. The irony is that it claims it needs $25,000 to continue. Set against the millions put aside for APCs (, or the thousands NPG claims is necessary to publish a single paper (, or the millions that ACM spends supporting its digital library (, this is small beer, and it shows the lack of seriousness with which we take web archiving. I hope it survives; if it does, Gunther Eysenbach, who runs it, tells me that the plan to expand the services they offer. It may yet become the archiving option of choice.

I have been able to find no on-demand alternative to WebCite. However, there are several other archives available. I have been using the UK Web Archive for a while now. I first heard about this service, irony or ironies, on the radio. Since I first used it to archive knowledge blog and later used it to archive this site, the process has got a lot easier. No longer do I need to send signed physical copyright permission; first it was electronic (email I think). It now appears that the law is changing to allow them to archive more widely (the BBC covered this in a story, categorized under “entertainment and arts” and which is largely focused on Stephen Fry’s tweets), although this will be a dark archive. Currently, this journal has been archived only once; from my other sites, it appears that they have a six month cycle. So, while this provides good digital preservation, it is a less good solution from the perspective of non-repudiablility; there is a significant gap before the archive happens, and a slightly longer one till the archive is published.

The UKWA is, as the name suggests, is UK specific. Another solution is to use, of course,, which might be considered to be the elephant in the room for web archiving. Unlike the UKWA, they don’t take submissions, but just crawl the web (although I suspect that the UKWA will start doing this also now). Getting onto a crawl can, therefore, be rather hit-and-miss. Frustratingly, they do have an “upload my data” service, which you can access through a logged in account; but not an “archive my URL” service. Again, a very effective resource from a digital preservation resource, but with similar problems to the UKWA from a point-of-view of non-repudiablilty. The archives take time to appear; in my experience, somewhat longer than the UKWA. I have also contacted their commercial wing, Their software and the crawls that the offer could easily be configured to do the job, but unfortunately, they are currently aimed very much at the institutional level: their smallest package provides over around 100Gb of storage; this blog can be archived in around 130Mb (this is without deduplication which would save a lot); even a fairly prolific blogger comes in at around 250Mb. The price, unfortunately, reflects this. Although, again, it is on a par with my yearly publication costs, so is well within an average research budget.

Of course, these solutions are not exclusive; with greycite we have started to add tools to support these options. For instance, kblog-metadata (, now supports an “archives” widget which is in use on this page; this links directly through to all the archives we know about. For individual pages, these are deep links, so you can see archived versions of each article straight-forwardly. The data comes from greycite, which we discover by probing; we may move later to using Mementos. greycite itself archives metadata about webpages, so we link to this also. As a side effect, these also mean that each article is submitted to greycite, which in turn causes archiving of the page through WebCite. Likewise, archive locations are returned within the BibTeX downloads, which is useful for those referencing sites.

Finally, greycite now generates pURLs — these are two-step resolution URLs which work rather like DOIs (or actually DOIs operate like pURLs, since as far as I am aware, pURLs predate the web infrastructure for DOIs). These resolve directly to the website in question. With a little support greycite can track content as and if it moves around the web; even if this fails, and an article disappears, greycite will redirect to the nearest web archive.

In summary, there is no perfect solution available at the moment, but there are many options; in many cases, archiving will happen somewhat magically. As we have found with many other aspects of author self-publishing on the web, it is possible to architecturally replicate many of the guarantees provided by the scientific publication industry through the simple use of web technology. Tools like greycite and kblog-metadata are useful in uncovering the archives that are already there, and linking these together with pURLs. Taken together, I have a reasonable degree of confidence that this material will be available in 10 or 50 years time. Whether anyone will still be reading it, well, that is a different issue entirely.


HEFCE is currently asking for feedback on the role of Open Access in the next REF. While I have a a number of technical suggestions, I think that the biggest and best contribution that the next HEFCE could make to the next REF is to state pubically that all journal/conference/venue metadata be removed from papers before they are sent for review.

It is time that we stopped judging books by their cover. It would be a fantastic contribution if HEFCE could take a lead on this. This is my full response.

Expectations for Open Access

I feel that one key issue is missing from this document. Scientists still have problems in some areas (including mine of computing science) in that the “high-impact” journals or conferences often provide no or prohibitively expensive open access options. In this past, I have refused to publish in these journals because I wish my work to remain open access and instead published elsewhere. However this works directly against my own interests in the current REF as the research will be judged less good. The use of journals as a primary indicator of quality, also works against my ability to choose cheaper venues. Few people believe statements that research will not be judged on publication venue; indeed, as an individual academic, I have even been told to directly comment on the venue in my return.

One simple and yet enormous contribution that procedures for the next REF could make to Open Access is to not to coerce, but to remove this enormous barrier. This could happen simply and straight-forwardly by removing all journal and publication venue metadata from papers when presented to reviewers. Of course, this reviewers could work around this (the data is a google search away), but the message sent by such a step would be enormous.

The general expectations for OA publishing seem reasonable. However, I think, I would add a further specific requirement. Currently, it is very hard to find the location of a green OA copy of any article. Making articles available is not enough; they must be discoverable. Therefore, I would suggest that a specific requirement that a primary identifier (DOI, ISSN, ISBN or URL) must be present in the institutional repository, and this must be visible on the web page and present in computational metadata. Finally, making the paper discoverable is also not enough. There must be computational and human-readable metadata making clear the contents of the paper are Open Access; without this form of explicit statement, the only safe course of action for readers to take is assume the copyright default position that you cannot use the material.

Institutional Repositories

Despite the significant investment, our experience is that few people ever retrieve data from institutional repositories. Partly, this is because it is difficult to link between articles on a journal website and articles in institutional repositories. As a second problem, institutional repositories provide an inconsistent experience, both for computational and human access. For instance, the presentation of identifiers such as DOIs is inconsistent. Even when present DOIs are often inaccurate, containing syntactic errors, which prevent their usage.

Ultimately, institutional repositories would be much better if there were a single infrastructure maintained at a national level (or international). In fact, a strong exemplar for this already exists in the form of arXiv. The ability to update the could be devolved to individual institutions. An authentication framework for this is already in place through JE-S.

Linking between institutional repositories and subject repositories unfortunately is likely to be difficult from a social perspective; there are many subject repositories and the institutional repositories are not likely to link to them well, because they are not experts in these repositories. This might be more plausible in a single national repository.

The better solution is to enable authors of papers to perform this linking. Scientists who actually care about the links working and being to the correct place are best place do this. This could be supported in the REF, by making linking to data, software or other subject repositories an explicit criteria in REF; this happens in some disciplines (for example, in bioinformatics a clear statement of if and where software is available and under what conditions is often asked for by reviewers).

Approach to Exceptions

If exceptions are to be for a transitional period, then they any exceptions given should be marked with a “sell-by” date, after which they should no longer be considered valid.

It is worth reiterating that embargoes really only benefit the publishers; ensuring that the REF framework allows academics to choose their publication venue more freely, rather than effectively requiring them to publish in selected “high-impact” venues would enable them to choose venues with short, or no embargo period. The most effective mechanism for achieving this would be to remove all publication venue information from future REF returns. The research would be judged on the basis of the research, and not the publication venue.

Open Data

There is more complexity behind the requirement for open data than for open access, particularly where the data needs to remain confidential for reasons of data protection. Having said all of this, there are many disciplines (again bioinformatics is an obvious example) where the majority of data is open. Making a decision now to rule this out of scope, for a REF which may be a significant distance in future seems premature.

Recently, I was contacted by a Kcite ( user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.

So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines (, but they are hardly alone in this.

<span class="slug-doi">10.1128/​AAC.01664-10

However, looking a bit further into this at the binary of this source and we see this:

00006260: 2020 2020 2020 2020 203c 7370 616e 2063           <span c
00006270: 6c61 7373 3d22 736c 7567 2d64 6f69 223e  lass="slug-doi">
00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30  10.1128/...AAC.0
00006290: 3136 3634 2d31 300a 2020 2020 2020 2020  1664-10.

The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.

Actually, I suspect that this is not the case, as the DOI is in the page metadata:

<meta content="10.1128/AAC.01664-10" name="citation_doi" />

It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness ( A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:

The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.

In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite ( uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.


Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.


Adding metadata to article could be done by many people. This could be the author, and in the ideal world, this would be the author. They know most about the content and are best placed to put the most knowledge into it. But, we have to answer the question, why would they do this? We have previously argued that semantic metadata must be useful to the people who producing it ( For this, we need tools that extract and consume this metadata.

I discovered a nice example of this recently while reading an interesting paper from Yimei Zhu and Rob Proctor (, investigating how PhD students use various tools to communicate. I was interested in citing this paper. The paper can be found on the web at the Manchester escholar site [webcite]. What metadata is in this page? Well, our Greycite ( tool is designed for this purpose; unfortunately, it suggests that there is very little in the way of metadata.

I contacted the escholar helpdesk, and they confirmed that there really is no embedded metadata; greycite has not just missed it. The strange thing is that, several days later, I managed to get to a very similar page [webcite]. It has a different layout and colour scheme, but it’s clearly the same. The bibliographic metadata fields (not easily extractable, sadly!) appear identical. However, investigating the metadata in this page, and we see a very different story. It is full of Dublin Core. It’s not ideally laid out, but it is all that we need for citation.

Unfortunately, there is no link between the two, nor do I know why Manchester has these two different pages; perhaps one is designed to replace the other. And, of course, from the point of view of reader, there is no reason why they would suspect that one contains metadata and the other does not.

The point here is not to criticise Manchester library services. Instead, it is to raise the question, why are the two locations so different in terms of their metadata? My suspicion is that the real answer is simple: very few people have noticed, and no one really cares. It might be argued that metadata must be correct to be useful. The evidence suggests that the inverse is true: metadata must be useful to be correct.

With tools like Greycite ( and kblog-metadata (, making the metadata useful is a key aim. Using kcite, I can now reference any article here in journal, or at bio-ontologies ( So now kcite users care about the metadata. From this page you can download a bib file for this article, or even for every article on the site (all 500+). This metadata comes directly from Greycite, which in turn scrapes it from this website. So now the site operator (me!) cares. And, I use the bib files to drive the tools that I use to cite my own work. So, now, the author (also me!) cares.

There is a chicken-and-egg situation here; why write the tools to operate over metadata when no one is using the metadata. Fortunately, with kcite we have had a gradual path: first we used DOIs, then pubmed IDs, then arXiv, and now any URI at all. And with Greycite, we have used a lot of heuristics, and quite a few metadata formats. While it has been a significant amount of work, metadata is now making our lives easier. This is the way that it must be.


Typographical correction.


Today, I recieved an email from a journal, asking me if I would review a paper. The paper in question is by, amoung others, Iddo Friedberg, and can be read on arXiv (1301.1740). I’ve known Iddo Friedberg for a while; he was an earlier user of my semantic similarity work (10.1093/bioinformatics/btg153), for protein function prediction (10.1110/ps.062158406), and was also the editor for our paper on realism in ontology development (10.1371/journal.pone.0012258). I would have liked to review this paper, and I feel a little bad because I know these things are important for the careers of the scientists.

So, why did I decline? Well, nice and simple; the page charges are just too high. There is no real justification for this as it can be done much cheaper ( — £200 or so seems reasonable; more over, I think it is bad for science because it is one of the factors that cause authors to think very carefully, and often save up work for “a bigger publication”. This can delay publication for years after the work has happened. Scientists have to think carefully about their research, and their work; thinking about whether to publish now or later is one piece of baggage that we could do without (

The real irony of the situation though is that the peer-review for this paper is and has already happened. The paper concerns bias in Gene Ontology annotations of protein functions. Iddo posted his work to the various Gene Ontology mailing lists; unsurprisingly, the GO annotation team saw the paper, and Rachael Huntley responded. The academic debate has started, and is in full swing. Others may see and contribute. And, frankly, the quality of the discussion going on there, and the depth of the analysis is higher than I would have given. No journal has been involved; it happened because there is a mailing list which the scientists in question used.

The current peer-review system does not add value; my peers and the scientific debate that does this. And this can, and will happen, regardless of the journals; indeed, in this case, why don’t the journal editors just read the mailing list?

So why do scientists, including myself, continue to publish in this way? It can often be difficult particularly where there are no open access options available ( We have to; it’s part, indeed, the main part of our assessment ( As I have said before, this is now the only reason I publish in this way (

Having said this, I do have my doubts. I feel somewhat guilty toward Iddo Friedberg, for instance. There is also a degree of hypocrisy in this — I will still submit to journals (for my own sake, of course, but also for my PhD students); will people, perhaps, wish to not review my articles? What would happen if everybody thought like this (here, I can use the Yossarian defence: then I’d be a damn fool to think any different). If I set the bar at £200, then who will I review for? Well, I do review for conferences and workshops where I can. Still, I feel that this is not enough; people review my work, I should review theirs. So, I state here, that subject to some time constraints, I will happily review work that is posted either to the web in this form, or to sites such as arXiv. Reviews will be posted here, on this blog.

I have my doubts; but open access is not enough. Publication must get lighter, faster and much, much cheaper. I would welcome alternative courses of action.

Many thanks to the Simon Cockell and James Malone who peer-reviewed this post, and provided helpful comments. I am also grateful to Iddo Friedberg who gave me permission to use the story about his paper in this way. The opinions expressed here are, however, my own.


In response to feedback from Mike Taylor, it is worth pointing out that I do not review for paywall journals, and have not for quite a while.