Abstract

Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

• Phillip Lord
• Lindsay Marshall

Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.

The evil a space can do

Recently, I was contacted by a Kcite user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.

So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines , but they are hardly alone in this.

 10.1128/​AAC.01664-10 

However, looking a bit further into this at the binary of this source and we see this:

 00006260: 2020 2020 2020 2020 203c 7370 616e 2063 00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30 10.1128/...AAC.0 00006290: 3136 3634 2d31 300a 2020 2020 2020 2020 1664-10.

The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.

Actually, I suspect that this is not the case, as the DOI is in the page metadata:

 

It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness . A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:

The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.

In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.

Update

Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.