With the release of Kcite 1.5 (http://wordpress.org/extend/plugins/kcite/), we now support multiple forms of citation (http://www.russet.org.uk/blog/2012/02/kcite-spreads-its-wings/). There have also been some changes to the implementation layer, however, that I will describe in this article. I have previously written critically about DOIs and their problems (http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/). One of my criticisms is the inability to access metadata about a DOI in a standardised way. In this article, I will consider the addition of content negotiation and whether this improves the situation. From this, I will draw a number of conclusions about the DOI system.


Background

DOIs offer a single point of entry mechanism for refering to a paper. A DOI such as “10.1371/journal.pone.0012258” refers to one of my papers (10.1371/journal.pone.0012258). It can be transformed into a URL by the additional of http://dx.doi.org to the front, giving http://dx.doi.org/10.1371/journal.pone.0012258. The DOI proxy service takes this URL and redirects the user to the “real” URL which contains the content in question. DOIs themselves are assigned by a registration agency. The majority of DOIs that refer to academic papers have been assigned by CrossRef (http://www.crossref.org). However, they are not the only registration agency — DataCite provides a similar service for, intuitively enough, data sets (http://www.datacite.org). The actual content — the papers or the data sets — are stored elsewhere. Both DataCite and CrossRef simply forward the user of these URLs to the publisher or data repository.

Our previous article (http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/), discussed a number of problems including the difficulty in accessing metadata about a given DOI. As well as being an issue of general concern, it is also a specific problem for the development of Kcite (http://knowledgeblog.org/kcite-plugin). This wordpress plugin generates reference lists from identifiers, including DOIs; it is active on this article. To do this, it captures metadata about each reference from a variety of different metadata servers.

CrossRef have recently announced the addition of Content Negotiation to their list of services (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html). This provides a mechanism to access metadata about a DOI, at least for those DOIs where CrossRef is the registration agency. This mechanism became more attractive with the announcement that it is now also supported by datacite (http://www.crossref.org/CrossTech/2011/10/datacite_supporting_content_ne.html). Finally, partly following a request of mine, CrossRef also now releases its metadata in JSON (http://www.crossref.org/CrossTech/2011/11/turning_dois_into_formatted_ci.html) ready for Citeproc-js (http://bitbucket.org/fbennett/citeproc-js). This format is used internally by Kcite, which required parsing from CrossRef unixref XML. Retrieving JSON directly had obvious advantages.


Accessing the metadata

Here, I describe the implementation of content negotiation for Kcite. The complete source of Kcite is available from Mercurial although not all of the changes described here were checked in (http://code.google.com/p/knowledgeblog/source/browse/trunk/plugins/kcite/kcite.php).

My original implementation for gathering CrossRef metadata used the file_get_contents method in PHP. Despite its name, this also works with URLs, providing a simple and straight-forward implementation path.

   $url = "http://www.crossref.org/openurl/?noredirect=true&pid="
            .$crossref."&format=unixref&id=doi:".$cite->identifier;
   $xml = file_get_contents($url, 0);

There are a number of issues with this implementation, not least the lack of any significant error handling. More over, the file_get_contents is not very adaptable; it performs a simple HTTP GET request. So, I decided to use PHP libcurl (http://php.net/manual/en/book.curl.php). The translation from file_get_contents is reasonably straight-forward.

$url = "http://dx.doi.org/{$cite->identifier}";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url );
$response = curl_exec ($ch);

curl_close($ch);

Initially this failed to work. I normally build and test my code on Ubuntu and, unfortunately, PHP libcurl is not installed with either WordPress, PHP or Apache. A search and aptitude install solve this problem. Now strange things happen. It turns out that the default behaviour of libcurl is to embed the retrieved content into the output — that is the outgoing web page. So, I need to add an option to the libcurl calls.

      $url = "http://dx.doi.org/{$cite->identifier}";

      // get the metadata with negotiation
      $ch = curl_init();
      curl_setopt ($ch, CURLOPT_URL, $url );
      curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true );

The code was still not working, and nothing appears to be returned though. Debugging a black box is never easy, so I need to get more information before going further. So, I added code to dump curl verbose information to a log file.

      // debug
      $fh = fopen('/tmp/curl.log', 'w');
      curl_setopt($ch, CURLOPT_STDERR, $fh );
      curl_setopt($ch, CURLOPT_VERBOSE, true );

A quick perusal of the HTTP requests show the problem. By default, a call to http://dx.doi.org returns a 303 See Other response. By default, libcurl does not follow this. Another command line option is required to fix this.

      $ch = curl_init();
      curl_setopt ($ch, CURLOPT_URL, $url );
      curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true );
      curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true );

Finally, we need to use content negotiation. The PHP libcurl library does not support this directly, so we need to set the HTTP headers for ourselves.

        curl_setopt ($ch, CURLOPT_HTTPHEADER,
                   array (
                          "Accept: application/citeproc+json;q=1.0"
                           ));

And I now have a solution. Kcite needed reworking, but mostly this involved removing the XML parsing layer, all was looking good. Except that while looking through my regression tests, I found that DataCite support has been broken. I was, at that time, accessing DataCite using a different interface.

    $url = "http://data.datacite.org/application/x-datacite+xml/"
         . $cite->identifier;

The difficulty was that previously I was accessing CrossRef directly to resolve DOIs. Asking CrossRef about a Datacite DOI resulted in an unknown DOI response. Kcite resonded to this response by trying DataCite next; unfortunately, there is no way that I know of to distinguish syntactically a DataCite and CrossRef DOI. With the new method, the content negotiated call to http://dx.doi.org succeeds, although DataCite does not know of the requested citeproc+json MIME type, so returns HTML. So, again, we need to extend the our DOI resolution, checking for the returned content type.

      $response = curl_exec ($ch);
      $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
      $contenttype = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

      // it's probably not a DOI at all. Need to check some more here.
      if( curl_errno( $ch ) == 404 ){
          curl_close($ch);
          return $cite;
      }

      curl_close ($ch);


      if( $contenttype == "application/citeproc+json" ){
          // crossref DOI
          //kcite specific logic follows.
      }

I should now be able to achieve a single call to resolve a DOI by modifying the headers once again. Here we request citeproc+json if possible or x-datacite+xml if it is not.

    curl_setopt ($ch, CURLOPT_HTTPHEADER,
               array (
                     Accept: application/citeproc+json;q=1.0, application/x-datacite+xml;q=0.9
                          ));

Unfortunately this fails also. While CrossRef returns citeproc+json, DataCite still returns HTML. Discussions with Karl Ward from CrossRef cleared up the problem. The content negotiation implementation of both CrossRef and DataCite was imperfect. DataCite’s implementation always tried to return the first content type; but it doesn’t know about citeproc+json, hence the HTML. Meanwhile CrossRef returns only the highest q value, rather than all types. Ironically, the problem was solved by doing this:


    curl_setopt ($ch, CURLOPT_HTTPHEADER,
               array (
                     Accept: application/x-datacite+xml;q=0.9, application/citeproc+json;q=1.0
                          ));

Crossref now returns JSON (because it has the highest q value), while datacite returns XML because it comes first. The final, complete and functioning method now appears as follows:

      $url = "http://dx.doi.org/{$cite->identifier}";

      // get the metadata with negotiation
      $ch = curl_init();
      curl_setopt ($ch, CURLOPT_URL, $url );
      curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true );
      curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true );


      // the order here is important, as both datacite and crossrefs content negotiation is broken.
      // crossref only return the highest match, but do check other content
      // types. So, should return json. Datacite is broken, so only return the first
      // content type, which should be XML.
      curl_setopt ($ch, CURLOPT_HTTPHEADER,
                   array (
                          "Accept: application/x-datacite+xml;q=0.9, application/citeproc+json;q=1.0"
                          ));

      // debug
      //$fh = fopen('/tmp/curl.log', 'w');
      //curl_setopt($ch, CURLOPT_STDERR, $fh );
      //curl_setopt($ch, CURLOPT_VERBOSE, true );

      $response = curl_exec ($ch);
      $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
      $contenttype = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

      // it's probably not a DOI at all. Need to check some more here.
      if( curl_errno( $ch ) == 404 ){
          curl_close($ch);
          return $cite;
      }

      curl_close ($ch);


      if( $contenttype == "application/citeproc+json" ){
           // crossref DOI
           // kcite application logic
      }

      if( $contenttype == "application/x-datacite+xml" ){
          //datacite DOI
          // kcite application logic
      }

Using the metadata

Although we now have a single point of entry for accessing the metadata about a DOI, the metadata itself is still not standardised. Although CrossRef has returned metadata in (nearly!) the form that we are going to use, DataCite has returned XML conforming to their own schema. We still need to parse this XML. Fortunately, this is relatively easy in PHP, using the SimpleXMLElement class and xpath. The full code is available, so here I just show the sections involving xpath, for example, to retrieve the publisher and the title.

    $journalN = $article->xpath( "//publisher");
    $titleN = $article->xpath( "//title" );

Initial testing suggested this works, sometimes. Unfortunately, I discovered that this failed for some DataCite DOIs. More solicitous debugging shows the problem; DataCite returns more than one form of XML. At first sight, the xpath should work, since the relevant elements are still in the same place. However, the default namespaces have changed — DataCite kernel 2.0 XML does not have a default namespace, while 2.1 and 2.2 do, which breaks the xpath. The situation is resolved by searching for namespaces, then parameterising the xpath queries.

       $namespaceN = $article->getNamespaces();
       $kn = "";
       if( $namespaceN[ "" ] == "http://datacite.org/schema/kernel-2.2" ){
           $kn = "kn:";
           $article->registerXpathNamespace( "kn", "http://datacite.org/schema/kernel-2.2" );
       }

       if( $namespaceN[ "" ] == null ){
           // kernel 2.0 -- no namespace
           // so do nothing.
       }

      $journalN = $article->xpath( "//${kn}publisher");
      $titleN = $article->xpath( "//${kn}title" );

I now have a system capable of gathering bibliographic metadata from a DOI.


Discussion

In our original post (http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/), we compared the situation with bioinformatics identifiers to DOIs. A Uniprot ID, for instance, such as http://www.uniprot.org/uniprot/P08100, resolves to a protein record while http://www.uniprot.org/uniprot/P08100.fasta returns the equivalent protein sequence. Content negotiation offers the possibility of achieving something similar with DOIs, at least with respect to the metadata if not the actual content.

My experience in practice shows that content negotiation does work and is useful, however, I am unconvinced that it is an ideal solution. From a theoretical stand point, the use of Accept headers seems nice. But in practice, it is painful because it is not commonly used. PHP does not support it, while even PHP with libcurl support requires me to set headers by hand, as there are no standard methods for doing so. Likewise, with curl on the command line, as shown in this example from CrossRef (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html) which retrieves RDF metadata.

curl -D - -L -H   "Accept: application/rdf+xml" "http://dx.doi.org/10.1126/science.1157784"

I would expect a similar experience within Perl, Python or Java; the tools of choice for a bioinformatician. I cannot email people a link to the metadata for a paper; I have no idea how you could access the RDF if you were using a desktop browser, or on a phone. From a personal perspective, I much prefer the approach offered by DataCite which uses URLs of the form http://data.datacite.org/application/x-datacite+xml/10.5524/100005 which is genomic data about Emperor Penguins (10.5524/100005). Content negotiation is hard work because although it is standard, being part of the HTTP specification, it is not common. The fact that neither DataCite nor CrossRef got their implementation right suggests to me that these are not my problems alone.

Of course, the DataCite approach is limited to DataCite DOIs, so http://data.datacite.org/application/x-datacite+xml/10.1371/journal.pone.0012258 returns a failure message. However, this mechanism implemented at http://dx.doi.org would add a valuable and additional interface; it is actually very easy to implement, with a simple call to the content negotiated stack; a form of the PHP described in this post would perform the task well.

My original criticisms of DOIs included the enormous variety of entities that DOIs actually resolve to: the article in HTML or PDF, an abstract and a picture, author biographies, or an image of a print out of the front page (http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/). Unfortunately, the experience is replicated at the metadata level. With two registration agencies, I have to deal with 4 different types of schema, although I am grateful to CrossRef to adding support for the one that I wanted. If I can managed to do an, admittedly, half-hearted job at integrating this data by blackbox resolution of a set of DOIs, it would be nice if the International DOI Foundation could do the job for me. Failing this, a single point of entry to the documentation for the different registration agencies would help.

Finally, the fact that DOIs provide a single, unified identifier at the metadata level turns out to be a disadvantage. There is, in reality, no such thing as a DOI; there are multiple different types of DOI. KCite supports two of them, that is CrossRef DOIs and DataCite DOIs. But there are 8 registration agencies (http://www.doi.org/registration_agencies.html). It is, therefore, not possible to know what content types if any will be returned before hand.

The more general problem is for a given DOI, to my knowledge, there is no way of knowing which registration agency is responsible, at least not at the level of a http://dx.doi.org URI (at the Handle level there must be, or the system would not work). For the average user, therefore, there is no way of knowing who is responsible for a given DOI. Strictly, this is true for a URL also. But if http://www.uniprot.org/uniprot/OPSD_HUMAN fails to resolve as I think it should do, there are a number of steps I can take. I can email webmaster@uniprot.org. I can browse from http://www.uniprot.org looking for a contact. I can type whois uniprot.org. For a DOI, I have none of these tools (or rather everything points to the International DOI Foundation).

This problem was exemplified a few days after completing the work on KCite described here. I noticed that PDB has DOIs for its records, which should have worked with KCite. However, they were failing to resolve. Consider this (elided) output from curl.

> curl -D - "http://dx.doi.org/10.2210/pdb3cap/pdb"
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ca/pdb3cap.ent.gz


> curl -D - -L -H  "Accept: application/citeproc+json" "http://dx.doi.org/10.2210/pdb3cap/pdb"
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://data.crossref.org/10.2210%2Fpdb3cap%2Fpdb

HTTP/1.1 404 Not Found
Date: Mon, 27 Feb 2012 13:58:39 GMT

Unknown DOI

The DOI resolves but the metadata does not. What was more confusing was this result which shows that some PDB DOIs did resolve.

> curl -D - -L -H   "Accept: application/rdf+xml" "http://dx.doi.org/10.2210/rcsb_pdb/mom_2012_2"
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://data.crossref.org/10.2210%2Frcsb_pdb%2Fmom_2012_2

HTTP/1.1 200 OK
Date: Mon, 27 Feb 2012 14:01:18 GMT
Content-Type: application/rdf+xml

In this case, it is possible to guess who the registration agency was (CrossRef) from the location of the RDF metadata, but this is undocumented and may not work for all registration agencies. Too much guesswork or specific knowledge of the DOI is involved. Thankfully, in this case, Karl Ward of CrossRef fixed the problem rapidly and now I can cite both the crystal structure of Opsin (10.2210/pdb3cap/pdb) and the Aminoglycoside Antibiotics (10.2210/rcsb_pdb/mom_2012_2).


Conclusions

DOIs are and remain problematic. The addition of content negotiation at first sight appears to be a considerable improvement, but it usage is more complex than it should be. I offer here three suggestions based on my experience:

  • An alternative based on simple HTTP GET URIs should be provided
  • A standardised metadata schema for all DOIs, or at least a single point of entry to the documentation for all DOIs.
  • For a given DOI, there must be a standard mechanism to discover which registration agency is responsible. Without this, it is hard to discover which documentation and which schema applies.

Despite this, KCite actively uses content negotiation; with it, I have dropped the number of HTTP requests I need to make to resolve the metadata for a DOI and this is a good thing. It is good to see the system getting more usable; I hope that this trend continues.

Bibliography

11 Comments

  1. Duncan Hull says:

    Hi Phil, interesting post. Have you sent a copy to Geoffrey Bilder? Love ’em or loathe ’em, I see there are now 52,883,768 DOI’s now in existence

  2. Phil Lord says:

    I am sure this will work its way back to CrossRef.

    Incidentally, your statistics are wrong. There are 52 million CrossRef DOIs, but this does not cover all DOIs because there are many registration agencies. How many are there in total? I don’t know, which is part of the problem. There is no DOI. There are multiple forms of DOI.

    I agree DOIs are a fact of life, and we are where we are. My question is how to make them better.

  3. Alf says:

    Content negotiation in PHP is straightforward: https://gist.github.com/2035480

    Content negotation in cURL on the command line is straightforward:

    curl -L -H “Accept: application/citeproc+json” “http://dx.doi.org/10.1126/science.1157784”

    CrossRef’s implementation of content negotiation is perfect: it should return the format with the highest q value, not all of them.

    The PDB example that failed earlier now works fine, so must have been temporarily broken or not registered yet:

    curl -L -H “Accept: application/citeproc+json” “http://dx.doi.org/10.2210/pdb3cap/pdb”

    All the problems described here seem due to DataCite not implementing HTTP properly, and failing to return a standard response format.

  4. Phil Lord says:

    One of the perils of writing a post with a theme of “perhaps this is a bit more complex than it needs be”, is that someone will always pop up and say “well, it seems simple enough to me”. Perhaps, you are just cleverer than I?

    My experience, however, is that Content Negotiation is NOT straight-forward, even if I achieved it. The post describes clearly how you do this in PHP, so I already know. Likewise, from the command line -H “Accept: application/citeproc+json”? I have to look this up every time; not straight-forward. The PDB example does now work fine, as my post says quite clearly. It was fixed after I reported the issue. As I said in the post, kudos to Karl Ward for his rapid fix.

    And the fact that different kinds of DOI point to different things, and are capable of returning different kinds of metadata with, as far as I can see, no way of distinguishing between these DOIs is, I feel, not the fault of DataCite.

    My suspicion is that you find it straight-forward because you have spent a lot of time working with bibliographic metadata. I am a bioinformatician, and bibliographic metadata is just one of the main kinds of data that I want.

  5. Alf says:

    The thing is, adding HTTP headers to a request is a standard part of interacting with most resources on the web, and once you’ve worked out how to do it in your favourite language (which admittedly isn’t necessarily easy, as HTTP interfaces are often built for flexibility), it is straightforward to re-use that in any situation. So, finding content negotiation straightforward when resolving DOIs isn’t because of spending a lot of time working with bibliographic metadata, it’s because of spending a lot of time working with HTTP.

    The nice – very important – thing about DOIs is that you can ask any DOI resolver about them and still get redirected to the actual resource: putting “10.5524/100005” or “10.1371/journal.pone.0012258” into “http://dx.doi.org/” or “http://doi.pangaea.de/” or “http://www.medra.org/”, for example, will redirect to the one main publisher of each resource.

    The trouble you’re finding is that not all resources on the web are available in exactly the same formats, which is understandable (from both points of view). It would be nice if all publishers provided their resources in multiple formats, but it’s also helpful that CrossRef stepped in and provided metadata on behalf of the publishers, where CrossRef has metadata for the object resolved (and CrossRef do intend to forward requests on to publishers, when an appropriate response is available from the source).

    The only remaining problem is finding a common format in which to represent all objects, which a universal “data browser” can understand. This is where RDF comes in. Both CrossRef and DataCite are able to return their resources as Turtle:

    curl -L -H “Accept: text/turtle” “http://dx.doi.org/10.1126/science.1157784”

    curl -L -H “Accept: text/turtle” “http://dx.doi.org/10.5524/100005”

    That requires a Turtle parser (e.g. ARC2 ), but again, that’ll be useful for parsing Turtle data for any resource. RDF/JSON is even easier to parse, but not available everywhere yet:

    curl -L -H “Accept: application/rdf+json” “http://dx.doi.org/10.1126/science.1157784”

    I note that DataCite’s content negotiation code is open for anyone to improve; presumably adding RDF/JSON as an output format wouldn’t be too difficult… https://github.com/datacite/conneg/blob/master/conneg.rb

  6. Phil Lord says:

    “The thing is, adding HTTP headers to a request is a standard part”

    Well “standard” in the sense of part of the W3C recommendation, yes. Common,
    no. I’ve never done it before.

    “it’s because of spending a lot of time working with HTTP.”

    Perhaps. From the point of a view of a scientist, trying to get work done,
    what difference does this make though?

    “The nice – very important – thing about DOIs is that you can ask any DOI
    resolver [and it will] redirect to the one main publisher of each resource.”

    I’ve never used another DOI resolver, so this is not useful for me. Besides
    which, the only reason that this is necessary is because all the publishers
    keep their own content, with their own website, to achieve the same job. So,
    this is not a “nice feature” of DOIs. It’s a necessary hack to cope with the
    confused situation in scientific pubishing.

    “It would be nice if all publishers provided their resources in multiple
    formats, but it’s also helpful that CrossRef stepped in and provided metadata
    on behalf of the publishers”

    I agree. I say as much in the post. Somewhat standard, somewhat organised and
    somewhat consistent is nicer than total mess. CrossRef is generally improving the situation and I am very glad that they are there.

    “The only remaining problem is finding a common format in which to represent
    all objects, which a universal “data browser” can understand. This is where
    RDF comes in. Both CrossRef and DataCite are able to return their resources as
    Turtle:”

    Probably you have tried parsing this. Maybe the magic of RDF will make this
    all happen, but as far as I can see, the two sets of RDF are unrelated to each
    other. Putting lots of stuff into a bucket isn’t integration. For me, this
    is no easier than my current solution — citeproc+json from Crossref (which I
    don’t parse) and Datacite XML which I do, but being XML is easy to parse.

    “I note that DataCite’s content negotiation code is open for anyone to
    improve; presumably adding RDF/JSON as an output format wouldn’t be too
    difficult”

    Maybe this will give me the motivation to learn ruby.

  7. Alf says:

    “as far as I can see, the two sets of RDF are unrelated to each other”

    https://gist.github.com/2043680

    The DataCite RDF is a bit broken and less rich than the CrossRef RDF, but they share the same ontology (Dublin Core and OWL) for all of the fields that overlap (CrossRef uses FOAF to represent the creator rather than a literal string, but it’s just one more step to get the creator’s name from there).

    Note again that this is not specific to bibliographic metadata – all that’s used is HTTP and an RDF parser.

  8. Karl Ward says:

    I haven’t had time to read through the whole article and comments here so I won’t contribute to the arguments for / against the use of content negotiation just yet, though I would like to point out that CrossRef’s content negotiation code is also open to the public at http://github.com/CrossRef/cn_proxy . We’d be more than happy to see people contribute additional content types, fixes, etc.

  9. Altmetrics, Traditional Metrics, and more - Weekly Twitter Activity 2012-03-16 | Michael Habib | Nudging Serendipity says:

    […] #acwri 2012-03-15RT @invisiblecomma: I like DOIs, and I like Content Negotiation: http://t.co/VtwCBYf5 (and I’m glad @phillord is drawing attention to so … 2012-03-15RT @grsprings: library […]

  10. What is Greycite | The Knowledgeblog Process says:

    […] While the DOI system has a number of significant issues (http://www.russet.org.uk/blog/1849), it does have one advantage; new DOIs are minted through a central authority, or rather one of several (http://www.russet.org.uk/blog/2044), that is the DOI registration agency. At the time of minting some registration agencies such as CrossRef, or DataCite require the registration of metadata about the DOI which can, in turn, be retrieved via content negotiation (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html) (http://www.russet.org.uk/blog/2006). […]

Leave a Reply