Recently, I was contacted by a Kcite (n.d.a) user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.

So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines (n.d.b) but they are hardly alone in this.

<span class="slug-doi">10.1128/​AAC.01664-10

However, looking a bit further into this at the binary of this source and we see this:

00006260: 2020 2020 2020 2020 203c 7370 616e 2063           <span c
00006270: 6c61 7373 3d22 736c 7567 2d64 6f69 223e  lass="slug-doi">
00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30  10.1128/...AAC.0
00006290: 3136 3634 2d31 300a 2020 2020 2020 2020  1664-10.

The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.

Actually, I suspect that this is not the case, as the DOI is in the page metadata:

<meta content="10.1128/AAC.01664-10" name="citation_doi" />

It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness (n.d.c) A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:

The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.

In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite (n.d.d) uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.


Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.