Recently, I was contacted by a Kcite (http://knowledgeblog.org/kcite-plugin) user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.
So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines (http://www.crossref.org/02publishers/doi_display_guidelines.html), but they are hardly alone in this.
<span class="slug-doi">10.1128/AAC.01664-10 </span>
However, looking a bit further into this at the binary of this source and we see this:
00006260: 2020 2020 2020 2020 203c 7370 616e 2063 <span c 00006270: 6c61 7373 3d22 736c 7567 2d64 6f69 223e lass="slug-doi"> 00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30 10.1128/...AAC.0 00006290: 3136 3634 2d31 300a 2020 2020 2020 2020 1664-10.
The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.
Actually, I suspect that this is not the case, as the DOI is in the page metadata:
<meta content="10.1128/AAC.01664-10" name="citation_doi" />
It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness (http://erambler.co.uk/blog/doi2oa-status-update). A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:
The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.
In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite (http://process.knowledgeblog.org/309) uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.
Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.