An Exercise in Irrelevance - Why Metadata Must be Useful

Adding metadata to article could be done by many people. This could be the author, and in the ideal world, this would be the author. They know most about the content and are best placed to put the most knowledge into it. But, we have to answer the question, why would they do this? We have previously argued that semantic metadata must be useful to the people who producing it (n.d.a) For this, we need tools that extract and consume this metadata.

I discovered a nice example of this recently while reading an interesting paper from Yimei Zhu and Rob Proctor (n.d.b) investigating how PhD students use various tools to communicate. I was interested in citing this paper. The paper can be found on the web at the Manchester escholar site [webcite]. What metadata is in this page? Well, our Greycite (n.d.c) tool is designed for this purpose; unfortunately, it suggests that there is very little in the way of metadata.

I contacted the escholar helpdesk, and they confirmed that there really is no embedded metadata; greycite has not just missed it. The strange thing is that, several days later, I managed to get to a very similar page [webcite]. It has a different layout and colour scheme, but it’s clearly the same. The bibliographic metadata fields (not easily extractable, sadly!) appear identical. However, investigating the metadata in this page, and we see a very different story. It is full of Dublin Core. It’s not ideally laid out, but it is all that we need for citation.

Unfortunately, there is no link between the two, nor do I know why Manchester has these two different pages; perhaps one is designed to replace the other. And, of course, from the point of view of reader, there is no reason why they would suspect that one contains metadata and the other does not.

The point here is not to criticise Manchester library services. Instead, it is to raise the question, why are the two locations so different in terms of their metadata? My suspicion is that the real answer is simple: very few people have noticed, and no one really cares. It might be argued that metadata must be correct to be useful. The evidence suggests that the inverse is true: metadata must be useful to be correct.

With tools like Greycite (n.d.c) and kblog-metadata (n.d.d) making the metadata useful is a key aim. Using kcite, I can now reference any article here in journal, or at bio-ontologies (n.d.e) So now kcite users care about the metadata. From this page you can download a bib file for this article, or even for every article on the site (all 500+). This metadata comes directly from Greycite, which in turn scrapes it from this website. So now the site operator (me!) cares. And, I use the bib files to drive the tools that I use to cite my own work. So, now, the author (also me!) cares.

There is a chicken-and-egg situation here; why write the tools to operate over metadata when no one is using the metadata. Fortunately, with kcite we have had a gradual path: first we used DOIs, then pubmed IDs, then arXiv, and now any URI at all. And with Greycite, we have used a lot of heuristics, and quite a few metadata formats. While it has been a significant amount of work, metadata is now making our lives easier. This is the way that it must be.

****Update****

Typographical correction.

n.d.d. https://knowledgeblog.org/kblog-metadata.

———. n.d.c. https://greycite.knowledgeblog.org.

———. n.d.a. https://www.russet.org.uk/blog/2054.

———. n.d.b. https://www.escholar.manchester.ac.uk/uk-ac-man-scw:187789.

———. n.d.e. https://bio-ontologies.knowledgeblog.org/table-of-contents.