Plain English Summary
Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.
Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.
In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of "interesting" websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.
This paper was first written for the Sepublica 2013 workshop.