Our original intention with Greycite (http://greycite.knowledgeblog.org) was to build a tool which can provide bibliographic metadata for any URL, to support my own kcite referencing tool (http://knowledgeblog.org/kcite-plugin). While it still fulfils this function, it also turns out to be a useful, general-purpose tool for investigating the metadata in various web pages. And this reveals some interesting results. We discovered a nice example of this recently while adding RIS support.

The paper in question comes from EMBO reports (10.1038/embor.2013.11). At first sight, the RIS for this page taken from Greycite looks reasonable.

UR - http://www.nature.com/embor/journal/v14/n3/full/embor201311a.html
Y2 - 2013-04-18 12:54:54
TI - The economics of creative research
JO - EMBO reports
PY - 2012
DA - 2012-02-08
DO - 10.1038/embor.2013.11
AU - Cou|[eacute]|e, Ivan
ER -

However, something strange is going on with the author; poor old Ivan Couée’s name has been rather broken. So, why is this happening? Looking at the underlying HTML the first thing that hits you is a lot of space; there are over 50 empty lines at the beginning of the file; still, this is only a problem for people strange enough to be reading the HTML.

However, eventually we get to the metadata, first dublin core and what we describe as Google Scholar (since this is where we found it). And there we have it; greycite is reporting the metadata as it is. The author’s name is represented with |[eacute]| as a letter.

<meta name="dc.language" content="en" />
<meta name="dc.rights" content="&#169; 2012 Nature Publishing Group" />
<meta name="dc.title" content="The economics of creative research" />
<meta name="dc.creator" content="Ivan Cou|[eacute]|e" />
<meta name="dc.identifier" content="doi:10.1038/embor.2013.11" />
<meta name="dc.date" content="2012-02-08" />

<meta name="citation_publisher" content="Nature Publishing Group" />
<meta name="citation_authors" content="Ivan Cou|[eacute]|e" />
<meta name="citation_title" content="The economics of creative research" />
<meta name="citation_date" content="2012-02-08" />
<meta name="citation_volume" content="14" />
<meta name="citation_issue" content="3" />
<meta name="citation_firstpage" content="222" />
<meta name="citation_doi" content="doi:10.1038/embor.2013.11" />
<meta name="citation_journal_title" content="EMBO reports" />

As far as we can tell this is an error; HTML attributes or extended character sets are entirely valid, but |[eacute]| does not appear to be a valid representation. Interestingly enough, there also appears to be some slightly buggy code in the PRISM metadata, which I am sure should not be this.

<meta name="prism.issn" content="ERROR! NO ISSN" />
<meta name="prism.eIssn" content="ERROR! NO EISSN" />

My guess is that the problem is at the point of website generation rather than deeper in the bowels of the publishing system; grabbing the metadata for this article from CrossRef by content negotiation (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html) shows the correct name.

 "URL":"http://dx.doi.org/10.1038/embor.2013.11","title":"The economics of
        creative research",
 "container-title":"EMBO reports",
 "publisher":"Nature Publishing Group",

We emphathise with the publishers here. Getting character sets correct is the bane of everyones life; given the state of computing when multi-lingual character sets appeared, we guess it is not an example of premature optimisation, but an example of an optimisation you wish had never happened. The world would be an easier place if everything that used unicode from the start.

The current metadata for this paper can be seen on greycite or in detail. Hopefully, this will be updated in time!

This post was written by Phillip Lord and Lindsay Marshall


Spelling mistake corrected, bibliography added. Thanks to Christian Perfect for bug report.


Leave a Reply