In this article, we will describe the rationale behind our new service, Greycite, that we have developed in general enable more formal citation of URLs, and specifically to back up the kcite citation engine.
Phillip Lord and Lindsay Marshall
School of Computing Science
As has been recently announced (http://www.russet.org.uk/blog/2012/05/kcite-greycite-and-kblog-metadata/), the kcite citation engine (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/), now supports URLs directly, as can be seen in this sentence. While it can do this trivially, by simply putting a URL in the reference, we wanted something better; where possible, we wanted URLs to be referenced in a similar manner to arXiv (http://arxiv.org/) or PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) IDs — with full bibliographic metadata where possible.
To achieve this, we have created the Greycite service, which captures metadata from a URL and then presents this back to kcite. In this short article, we describe the rationale behind the creation of this service.
The kcite citation engine allows WordPress users to reference an article through the use of a shortcode, of the form [cite]10.1371/journal.pone.0012258[/cite] which is rendered as (10.1371/journal.pone.0012258). The rendering uses metadata from a third party service, in this case provided by CrossRef (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html), to generate the bibliography reference. Other identifiers are handled similarly, using other services.
We wished to achieve something similar with an arbitrary URL. However, there is no centralised service where authors are required to lodge their metadata for any URL. We considered the possibility of providing such a service where content authors could lodge their metadata — author, date, title and so on, about a URL. However, it seems unlikely that this would succeed for two critical reasons. First, and most importantly, few authors would be likely to go the extra effort: why would they bother, and if they did why use our service rather than some other. Second, it would require a authentication step to ensure that metadata genuinely came from the person controlling the URL. We also considered the possibility of deliberately allowing third party addition of metadata, but this raises the question of conflicts in the metadata.
As a result, in practice, we feel that the only sensible cause of action is to extract the metadata directly from the resolvable contents of the URL, as this ensures that we have taken metadata from what is (quite literally) the authoratitive source. The significant drawback to this is that if the author does not provide this metadata, no one else is able to do so. In a sense, though, this is correct: if authors provide no metadata, then this is how their works should appear, as this is their choice. Moreover, as we have argued previously (http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/), if authors or their readers are worried by this, it may provide the motivation to add bibliographic metadata to their work which is a benefit to everyone.
The immediate problem here is the lack of single standardised bibliographic metadata on the web; however, there are a number of systems which are currently in use, namely, COinS (http://ocoins.info/), Open Graph Protocol (http://ogp.me/) and Google Scholar tags (http://scholar.google.com/intl/en/scholar/inclusion.html). We also have also considered a fourth option which is RSS/Atom feeds which, perhaps ironically, are structured enough to provide bibliographic metadata. At the moment, we do not have accurate statistics on the prevelance of each of these types of metadata — of course, we could crawl the web to gather these statistics, but we are not really interested in the web in general, but in the academic sector of it which is hard to determine a priori. However, our initial experiences suggest the following:
- COinS metadata is not widespread. We suspect that this follows from our experience that the specification is hard to find and incomprehensible when you do (http://www.russet.org.uk/blog/2012/03/kblog-metadata/).
- Google Scholar tags are much more widespread, although there is some variation (the use of name vs property for instance, or multiple authors represented in a single tag vs each author on their own).
- OGP appears reasonably widespread, including in articles which are not academic (or not solely so) but likely to be cited, such as BBC News, or anything hosted on WordPress.com.
- RSS/Atom worked fairly well, however normally only contain metadata for recent articles; we tried to track RSS feeds, but this resulted in 1000s of URLs very quickly.
Over time, we should be able to get clearer statistics as to real usage of these systems, based on the data in greycite.
Greycite is currently packaged as a service, rather than embedded within WordPress, which would also have been possible. The reasons for this were several. First, gathering metadata involves a reasonable amount of parsing, and putting this all into a WordPress plugin seemed unnecessarily heavy. This is particularly so, given that server load is already an issue with kcite, and adding further to this did not seem sensible.
Second, we wanted to maintain a database of the metadata gathered from around the web. This allows us to deal with problems of resources changing or disappearing. We want the user to be able to cite a URL and for this citation to not break if the URL disappears and becomes 404. We also wish to be able to cite a URL at a specific date, and have the citation show the metadata for that time. Placing this load on the individual wordpress database backend does not really make sense. Moreover, with greycite, there is a reasonable likelihood that others will have cited a particular article, thereby sharing the load.
Third, Greycite is also useful outside of WordPress. So, for instance, Greycite also provides bibtex so can be used with a bibliographic manager, which is very useful at authoring time, as we can use this metadata to search over a list of relevant URLs, and then to select between then.
Finally, we wanted to be able to add additional functionality, which may require upgrading the database periodically, which is harder to do within a plugin. For example, we have already added links through to the UK Web Archive (http://www.webarchive.org.uk/ukwa/), for those resources which are archived. We will add the Internet Archive (http://www.archive.org/), and Web Cite (http://www.webcitation.org/) in time also. This means that not only should citations remain displayed correctly if resources disappear or change, it should still be possible to get to their contents in many cases.
The existence of Greycite allows us to turn a blog post into a linked data, academic article. The reader of an article sees as well as the content directly generated by the author, data gathered from all the outgoing links. The reference list, therefore ceases to be a mechanism for finding secondary sources, and becomes a usability tool; readers can understand what sources are being relied on, without having to remember URLs or click through to them. Likewise, the authors can use the linked data environment outside of a web browser to help enable authoring. Metadata that is useful to readers is, unsurprisingly, also useful to authors (who tend to be the first person to read an article anyway!).
With Greycite, we were interested in adding more formal citation to the web in general, and more specifically supporting kcite (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/). We believe that we have achieved this in part with a relatively light-weight service. Greycite is useful for article display, and for authoring.
In addition, we start to address the issues of link breakage, by building on the back of existing archiving services. Articles will be able to still display article metadata if an article disappears. Future versions of kcite will also redirect links to the nearest web archive when this happens. We have done this without the recourse to secondary identifiers such as a DOI or PURL, which we believe represents a better user experience. Building on the back of existing web archives also addresses a critically scalability issue; the Greycite database needs only to store bibliographic metadata which is likely to remain tractable. From a legal perspective, we also side-step issues of copyright, as gathering metadata alone is likely to be covered by fair dealing clauses.
By depending only on metadata present in the URL itself, we can guarantee that metadata is authoratitive (not, of course, that it is “correct”, as in reflects the authors intentions, but it does match what they said). It also means that we do not control the metadata; it has not been entered into greycite; it is out there, available on the web, free for anyone to gather. We wish to be part of the semantic web, not a walled garden within it.
Finally, we have started to build a linked data environment for academic publishing. Bibliographic metadata is, of course, only the start. It is not a suitable way to present all kinds of information; for instance, Chemicalize (http://www.chemicalize.org/) provide a nice plugin which transforms chemical names into something richer. But by harnessing the power of the web, and building on existing resources, we should be able to build a rich and full featured environment for presenting scientific knowledge.