In my previous article, I discussed my ongoing struggles with DOIs and their metadata (http://www.russet.org.uk/blog/2012/03/dois-and-content-negotiation/). The article discussed the difficulties with implementing content negotiation for kcite (http://knowledgeblog.org/kcite-plugin); in particular, getting metadata for a given DOI and understanding that metadata once it had been fetched. Here, I discuss these two issues again!


Accessing the Metadata

I have previously described the implementation of Content Negotiation for DOIs in Kcite. In the examples given, I used libcurl which has the flexibility to perform content negotiation. One difficulty with this approch is that libcurl is not a standard requirement for WordPress at least on Ubuntu. So, adding this requirement, forces users to make an additional step when installing the plugin.

Following on from the previous article, I have decided to update all of the resolution mechanisms in Kcite to use libcurl, as it made little sense to have several different mechanisms in place. Ironically, in the process, I found that the direct use of libcurl was unnecessary anyway. The reason for this is that WordPress has its own HTTP API (http://codex.wordpress.org/HTTP_API), and it is possible to use this instead of libcurl. Underneath, it uses one of several transport mechanisms, including libcurl which is probably the fastest, but also pure PHP solutions which make life a bit easier.

Most of my criticisms of content negotiation, however, still remain. Unlike libcurl WordPress’ API automatically follows 303 See Also redirect messages which http://dx.doi.org returns. However, content negotiation is not directly supported, so I still need to set the HTTP headers manually. The format of these did not appear to be documented, but I discovered them through hacking the WordPress core of my test installation. The core of my code for Kcite now looks as follows:

      $url = "http://dx.doi.org/{$cite->identifier}";

      $params = array(
                      'headers' =>
                      array( 'Accept' =>
                             "application/x-datacite+xml;q=0.9, application/citeproc+json;q=1.0"),
                      );


      $wpresponse = wp_remote_get( $url, $params );

      if( is_wp_error( $wpresponse ) ){
          return $cite;
      }

      $response = wp_remote_retrieve_body( $wpresponse );
      $status = wp_remote_retrieve_response_code( $wpresponse );
      $headers = wp_remote_retrieve_headers( $wpresponse );
      $contenttype = $headers["content-type"];

      // it's probably not a DOI at all. Need to check some more here.
      if( $status == 404 ){
          // kcite code
      }

      if( $contenttype == "application/citeproc+json" ){
          // crossref DOI
      }

      if( $contenttype == "application/x-datacite+xml" ){
          //datacite DOI
      }

Which registration agency

Having updated Kcite to use the HTTP API for all of its metadata resolution methods, I thought it would be wise to check all of my test cases to see that they were working correctly. I found one DOI which was and, in fact, always has been behaving incorrectly, returning the wrong metadata.

The reason for this was a slightly dubious piece of logic in Kcite, put in place when we have less understanding of DOIs. Where we could not find metadata from CrossRef about a DOI, we were free text searching Pubmed for the DOI, and using Pubmed metadata instead. Unfortunately, the free text search of Pubmed is not fuzzy, and it was this that was resulting in erroneous metadata. I have now removed this code from Kcite which may result in some DOIs that appeared to work previously now failing.

Once again, we see the difficulty in being unable to determine the registration agency for a given DOI. I can only conclude that the DOI in question was not allocated by either CrossRef or DataCite, as neither returned metadata for this DOI; a conclusion on negative data, however, is not a strong one. It could (and has) also be the case that either CrossRef or DataCite metadata services are not working properly.

I tried investigating the DOI Registration Agency (http://www.doi.org/registration_agencies.html) page from the International DOI foundation (IDF). None of these seemed obvious candidates. It appears that the IDFs web page is incorrect.

The DOI in question? This is http://dx.doi.org/10.1000/182. This DOI is the identifier for the “DOI Handbook”. And the registration agency missing from the IDFs page. That would be the International DOI Foundation.


Demonstration

At the time of posting, this blog is using a released version of Kcite, hence a link to the DOI (10.1000/182) actually appears as a link to a Pubmed paper (21785971). The current (unreleased) development version of Kcite now reports the absence of metadata for this DOI which this post will (or may already) display once I update.

The underlying HTML source, it should be noted, still contains the correct link (to the DOI handbook), so a computational agent consuming this page would still detect the authors original intention, even with the error from Kcite.


Update

Fixed broken links!

14/05/2012: My blog has been updated, so 10.1000/182 now shows up as metadata missing, which is correct behaviour from kcite.

Bibliography

2 Comments

  1. Alf says:

    The problem here is that the URL (http://www.doi.org/hb.html) to which this DOI (10.1000/182) resolves is not available in the format requested (application/citeproc+json), so the server chooses to returns HTML (it could also choose to return a 406 “Not Acceptable” HTTP status code instead).

    It shouldn’t matter which agency the DOI was registered with, as any DOI resolver should resolve the DOI to the same URL.

  2. Phil Lord says:

    The problem here is that I cannot know in advance what kinds of content a DOI can return without knowing which registration agency is responsible.

    The problem is also that having got back a response that I was not expecting, I cannot know whether this is an error or expected, without knowing which registration agency is responsible.

    The final problem is that the list of registration agencies on the IDFs own pages is wrong.

    That any DOI resolver will forward to the same URL is not a big deal. The only thing that the DOI is giving me over and above what HTTP gives me for sure when consumed computationally, is that if I get back 404, this is an error, rather than an expected part of the web.

Leave a Reply