Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

  • Phillip Lord
  • Lindsay Marshall

Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.

Leave a Reply