Archive for May, 2012

Following up from somewhat combatative article of yesterday (http://www.russet.org.uk/blog/2012/05/semantic-web-irony/), my colleagues Michel Dumontier and Bijan Parsia pointed out that my last post was technically wrong. Actually, Bijan said “you’re an idiot who can barely use a computer”. Still one of the reasons I publish my work and thoughts early in this journal is to get feedback about it, I can’t complain about this.

The file 978-3-642-30283-1_Book_PrintPDF.pdf is actually not just the table of contents as I had taken it to be, but all 900 pages of the proceedings. It isn’t hyperlinked, but you can search or jump to page with your PDF viewer to get to the relevant article. The files with the form eswc2012_submission_nnn.pdf are for the demonstrations and the posters.

My mistake. I was wrong. Somewhat.

Update

  • Hyperlink to Bijan’s page corrected following his request.
  • Addressed spelling mistake in latin, by moving to English

Bibliography

It was interesting to go the ESWC 2012; it has been quite a few years since I have been to ESWC or, indeed, any semantic web conference. While I am not generally a live-blogger, I have already commented on some aspects of conference (http://www.russet.org.uk/blog/2012/05/semantic-web-irony/). Here I will just consider a few of the talks which leapt out at me for good or bad reasons.

I did enjoy the first keynote from Abraham Bernstein (10.1007/978-3-642-30284-8_1): it was a brave talk, not because it managed to wind Greek mythology into it, but because he started off with the opening credits from Star Trek. At a computing conference, this is setting yourself up with a hard act to follow. If I can over-simplify, the key thesis of the talk was largely that trying things out in practice is the best way to see how things work in theory. This is a theme I shall return to.

As I have suggested previously, a talk on the Music Ontology also interested me (10.1007/978-3-642-30284-8_24). Essentially, the idea here is to define a metric assessing the quality of an ontology by measuring how well it fulfils the user requirements. This looks very useful, although at the moment, it does not appear that all of the measurements are automated. The reason that automation would be ideal is that this form of measure would potentially be very useful in more agile forms of ontology development; essentially, they could take the place of an automated test framework, allowing the developer to ask whether new concepts added had helped to address more user queries or not.

The last paper I was involved with at this conference was about a semantic service matching framework called Feta (10.1007/11431053_2). Since I left Manchester, I have rather lost touch with this work, although the research theme can still be seen in BioCatalogue (10.1093/nar/gkq394). I was interested there to listen to a talk on semantic web services (10.1007/978-3-642-30284-8_40), particularly as it was using the information content measures that I used many years ago over the Gene Ontology (10.1093/bioinformatics/btg153). Unfortunately, not that much appears to have changed since I have left the field. The only advance seems to have been the generation of a “gold standard” dataset; while this is not a bad thing, it is also a reflection that SWS are just not being used in the wild. I also worry about the methodology, though, of testing against a gold standard that was predefined. To me, it seems like a case of cherry-picking. The results just would not have been reported had they not showed some improvement over previous metrics; the risk is that the metric is being tuned to the individual gold standard, rather than the general research problem.

One criticism that I cannot make of Maria Keet’s presentation on mereotopological relationships is the lack of testing (10.1007/978-3-642-30284-8_23). While the first part of the paper, deals with the theoretical underpinning of part-whole relationships, a significant part of the paper shows their user testing of OntoPartS, a tool they have developed to allow ontology developers to pick the correct type of relationship. While the paper gives good examples, showing that the part-whole relationships used are valuable, in the sense that the allow inferences that could otherwise not happen, I worry about their interpretation of their user testing suggesting that it takes “a mere 4 minutes to choose the correct relation”. This might be reasonable for an ontology developer, but users will need access to these distinctions if they are to be useful, and 4 minutes is a long time. I think, for this reason, I would much prefer the user driven approach of the music ontology, rather than extension from theory approach when determining what part-whole relationships we need. The Gene Ontology managed to get an awful long way with one relationship (10.1016/j.websem.2003.12.003).

My biggest worry about the conference as a whole, though, is that how similar the experience is now, to five years ago. This makes me worry that the field is not advancing. Perhaps part of the reason for this is the lack of strong application drivers. This seems to be acknowledged by the separation of the conference into “research” and “in-use” tracks. This categorisation seems broken anyway: so a paper on “Evaluating scientific hypotheses using the SPARQL Inferencing Notation” (10.1007/978-3-642-30284-8_50) is apparently not research, but in-use. However, “Curate and storyspace: An ontology and web-based environment for describing curatorial narratives.” (10.1007/978-3-642-30284-8_57) is research; and, as a result, we should conclude not useful?

In the end semantic web has been significantly rebranded as the linked data initiative. In the bar, I heard the comment, “ah, but don’t they know they will need semantics eventually”. Well, yes, they will. And “they” probably know this. Google has, or at least, is investigating more semantic representations, rather than the pure statistical approach it started off with. But ultimately heavy duty semantics is only ever going to be a niche market, by people who care enough, and need the expressivity enough for it to be worth the hassle. I’ve been working with and on ontologies for many years, and I know the value of a reasoner, and the value of heavy-duty logics. But, if we let ourselves be overwhelmed by the technology, we miss the reality that we can achieve a lot with very little. Perhaps the best indication of this is that the award for most influential paper from 7 years ago went to a paper on SIOC which is relatively light in terms of semantics (10.1007/11431053_34). The semantic web community (of which I have never been more than an interloper) may like to say that a little semantics goes a long way (http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html), but I am not sure that it actually believes it.

Bibliography

I am at the Extended Semantic Web Conference (http://2012.eswc-conferences.org/). I haven’t published or been to this conference for quite a while (10.1007/11431053_2), so I was interested to see how things have changed in the meantime.

The first talk that I went to was from Yvew Raimond from BBC R&D on the Music Ontology (10.1007/978-3-642-30284-8_24). They are using this to drive parts of their website. He was talking about how to evaluate this ontology. Very interesting. Worth reading the paper, I thought.

So, I decided to look it up. A short piece of Googling later, got me through to the paper on the web. Unfortunately, the conference organisers have decided to publish with Springer, so no access there. Of course, I might have access at my home institution. Fortunately, I knew how to deal with this virtue of my colleague Jon Dowland in a quick and easy way, at least on linux. For general interest, the solution is here:

ssh -D 1080 username@hostname

And then FoxyProxy to reconfigure Firefox to use a local SOCKS proxy. Sadly, that didn’t work either. So, it looks like no access there either. Apparently my University doesn’t subscribe, or Springer uses something more complex than IP authentication.

At this point, I remembered the USB pen attached to my geek badge, which has the conference proceedings on. Great. So, I take a look at this. Hmmm. No “index.html” or similar to that. Let’s try listing the directory contents. The somewhat elided output looks like this.

> ls -lR
-rw-r--r-- 1 phillord phillord 168K 2012-04-30 17:32 eswc2012_submission_301.pdf
-rw-r--r-- 1 phillord phillord 1.2M 2012-04-30 17:33 eswc2012_submission_303.pdf
-rw-r--r-- 1 phillord phillord 468K 2012-04-30 17:35 eswc2012_submission_304.pdf
-rw-r--r-- 1 phillord phillord 343K 2012-04-30 17:36 eswc2012_submission_312.pdf
-rw-r--r-- 1 phillord phillord 217K 2012-04-30 17:36 eswc2012_submission_315.pdf
-rw-r--r-- 1 phillord phillord 730K 2012-04-30 17:37 eswc2012_submission_316.pdf
-rw-r--r-- 1 phillord phillord 527K 2012-04-30 17:37 eswc2012_submission_321.pdf
-rw-r--r-- 1 phillord phillord 651K 2012-04-30 17:38 eswc2012_submission_322.pdf
-rw-r--r-- 1 phillord phillord 2.3M 2012-04-30 17:38 eswc2012_submission_323.pdf
-rw-r--r-- 1 phillord phillord 308K 2012-04-30 17:34 eswc2012_submission_325.pdf
-rw-r--r-- 1 phillord phillord 374K 2012-04-30 17:39 eswc2012_submission_331.pdf
-rw-r--r-- 1 phillord phillord  44K 2012-04-30 17:40 eswc2012_submission_349.pdf
-rw-r--r-- 1 phillord phillord 333K 2012-04-30 17:40 eswc2012_submission_351.pdf
-rw-r--r-- 1 phillord phillord 400K 2012-04-30 17:40 eswc2012_submission_357.pdf
-rw-r--r-- 1 phillord phillord 115K 2012-04-30 17:41 eswc2012_submission_363.pdf

Not very handy for discovery or navigation. But, it’s okay, because there is some organised metadata. I found this in the following files:

-rw-r--r-- 1 phillord phillord  56M 2012-04-23 13:55 978-3-642-30283-1_Book_PrintPDF.pdf
-rw-r--r-- 1 phillord phillord 3.2M 2012-04-23 13:55 978-3-642-30283-1_Cover_PrintPDF.pdf

As far as I can tell, the PDF is not linked, nor does it tell you the submission numbers. Still, I do know it is on page 255. So I can probably buy a copy of the paper proceedings and find the article. 95 euro and a brick to take home. Still, the cover page looks nice.

Epic Fail conference organisers. The proceedings are not open, they are not linked and apparently have no machine computational semantics. Perhaps the organisers should go to a semantic web conference at some point to find out about metadata.

If Springer cannot do the job — and evidentally they cannot — then use someone else. I managed all of this with Bio-Ontologies. A conference on semantics, the web, and linked open data should be able to also.

Bibliography

I am pleased to announce that as part of my work on knowledgeblog (http://www.knowledgeblog.org/), we now have two new tools — Greycite and kblog-metadata — and have extended kcite, our citation engine (http://knowledgeblog.org/kcite-plugin). I will just give a brief overview here of the functionality here. Subsequent articles will describe these tools in more detail, explaining the rationale behind them.

The kcite engine, which you can see in use in this article, produces a nicely formatted bibliography list, generated using only identifiers to these articles: DOIs, Pubmed IDs or arXiv IDs. One obvious absence from this list, however, is the ability to directly cite URLs. We have now started to address this, through our two new tools.

Unlike other identifiers, we lack a centralised resource capable of delivering bibliographic metadata about a URL. To enable this, my colleague, Lindsay Marshall (http://www.ncl.ac.uk/computing/staff/profile/lindsay.marshall), has developed Greycite (http://greycite.knowledgeblog.org/), which went live earlier this week. Greycite allows you to search for bibliographic metadata about a given resource. So, for instance, you can view the metadata for my article on realism (http://www.russet.org.uk/blog/2010/07/realism-and-science/). Probably more usefully than this view, however, is that you can also retrieve this metadata computationally: currently, we support JSON suitable for citeproc-js (http://bitbucket.org/fbennett/citeproc-js), and bibtex (http://www.bibtex.org/). Obviously, we can support further formats if we choose; fortunately, the metadata for a URL is, in general, very simple (date, title, website or “container” title).

Greycite must, however, get its metadata from somewhere. As we wanted greycite to be both an automated and authoratitive source, we have decided to take metadata only from the URL being referenced (or referenced from the URL). Anything else would have required an authentication step, to prove that metadata was being provided by the owner of the content. I will describe this in more detail later; we support COiNS (http://ocoins.info/), OGP (http://ogp.me/) and Google Scholar Metatags (http://scholar.google.com/intl/en/scholar/inclusion.html). In practice, this combination of sources allows us to provide rich references to many URLs. Where not, we fallback gracefully.

Unfortunately, formal metadata on the web is not heavily controlled or pre-defined. If you are using WordPress to publish your articles, it is largely dependant on your theme as to whether there is any metadata on your articles. I have started to address this with kblog-metadata (http://wordpress.org/extend/plugins/kblog-metadata/). Again, I will describe the functionality in greater detail later, but essentially, this plugin adds metadata in all three of the formats mentioned above in the document headers, and provides a good deal of flexibility about where that metadata comes from.

Finally, I have extended kcite to query for metadata from greycite for each URL cited. The data coming back is used directly for rendering, so this should have reasonable performance; moreover all data is cached in the WordPress database, limiting outgoing network traffic from the webserver for each reference.

Work is not complete yet, and there is much more to do. However, I have been using development versions of these tools now for a month or so, and the experience is rather good. The metadata is useful during authoring, as it can be used to find the correct reference. While we cannot capture metadata from all sources, a surprisingly large number of them do work. And the development of greycite means that this metadata can be served efficiently and without adding too much complexity to kcite. In short, while it may not be a complete solution, these enhancements represent a substantial step toward making academic URLs formally citable, as others have recently called for (http://michaelnielsen.org/blog/is-scientific-publishing-about-to-be-disrupted/).


Addendum

2012-05-09: I have already published an initial article (http://www.russet.org.uk/blog/2012/03/kblog-metadata/) about kblog-metadata, which should have been referenced here.

Bibliography

In this article, we will describe the rationale behind our new service, Greycite, that we have developed in general enable more formal citation of URLs, and specifically to back up the kcite citation engine.


Authors

Phillip Lord and Lindsay Marshall
School of Computing Science
Newcastle University


Introduction

As has been recently announced (http://www.russet.org.uk/blog/2012/05/kcite-greycite-and-kblog-metadata/), the kcite citation engine (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/), now supports URLs directly, as can be seen in this sentence. While it can do this trivially, by simply putting a URL in the reference, we wanted something better; where possible, we wanted URLs to be referenced in a similar manner to arXiv (http://arxiv.org/) or PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) IDs — with full bibliographic metadata where possible.

To achieve this, we have created the Greycite service, which captures metadata from a URL and then presents this back to kcite. In this short article, we describe the rationale behind the creation of this service.


Discovering the metadata

The kcite citation engine allows WordPress users to reference an article through the use of a shortcode, of the form [‌cite]10.1371/journal.pone.0012258[/cite‍] which is rendered as (10.1371/journal.pone.0012258). The rendering uses metadata from a third party service, in this case provided by CrossRef (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html), to generate the bibliography reference. Other identifiers are handled similarly, using other services.

We wished to achieve something similar with an arbitrary URL. However, there is no centralised service where authors are required to lodge their metadata for any URL. We considered the possibility of providing such a service where content authors could lodge their metadata — author, date, title and so on, about a URL. However, it seems unlikely that this would succeed for two critical reasons. First, and most importantly, few authors would be likely to go the extra effort: why would they bother, and if they did why use our service rather than some other. Second, it would require a authentication step to ensure that metadata genuinely came from the person controlling the URL. We also considered the possibility of deliberately allowing third party addition of metadata, but this raises the question of conflicts in the metadata.

As a result, in practice, we feel that the only sensible cause of action is to extract the metadata directly from the resolvable contents of the URL, as this ensures that we have taken metadata from what is (quite literally) the authoratitive source. The significant drawback to this is that if the author does not provide this metadata, no one else is able to do so. In a sense, though, this is correct: if authors provide no metadata, then this is how their works should appear, as this is their choice. Moreover, as we have argued previously (http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/), if authors or their readers are worried by this, it may provide the motivation to add bibliographic metadata to their work which is a benefit to everyone.

The immediate problem here is the lack of single standardised bibliographic metadata on the web; however, there are a number of systems which are currently in use, namely, COinS (http://ocoins.info/), Open Graph Protocol (http://ogp.me/) and Google Scholar tags (http://scholar.google.com/intl/en/scholar/inclusion.html). We also have also considered a fourth option which is RSS/Atom feeds which, perhaps ironically, are structured enough to provide bibliographic metadata. At the moment, we do not have accurate statistics on the prevelance of each of these types of metadata — of course, we could crawl the web to gather these statistics, but we are not really interested in the web in general, but in the academic sector of it which is hard to determine a priori. However, our initial experiences suggest the following:

  • COinS metadata is not widespread. We suspect that this follows from our experience that the specification is hard to find and incomprehensible when you do (http://www.russet.org.uk/blog/2012/03/kblog-metadata/).
  • Google Scholar tags are much more widespread, although there is some variation (the use of name vs property for instance, or multiple authors represented in a single tag vs each author on their own).
  • OGP appears reasonably widespread, including in articles which are not academic (or not solely so) but likely to be cited, such as BBC News, or anything hosted on WordPress.com.
  • RSS/Atom worked fairly well, however normally only contain metadata for recent articles; we tried to track RSS feeds, but this resulted in 1000s of URLs very quickly.

Over time, we should be able to get clearer statistics as to real usage of these systems, based on the data in greycite.


Greycite as a service

Greycite is currently packaged as a service, rather than embedded within WordPress, which would also have been possible. The reasons for this were several. First, gathering metadata involves a reasonable amount of parsing, and putting this all into a WordPress plugin seemed unnecessarily heavy. This is particularly so, given that server load is already an issue with kcite, and adding further to this did not seem sensible.

Second, we wanted to maintain a database of the metadata gathered from around the web. This allows us to deal with problems of resources changing or disappearing. We want the user to be able to cite a URL and for this citation to not break if the URL disappears and becomes 404. We also wish to be able to cite a URL at a specific date, and have the citation show the metadata for that time. Placing this load on the individual wordpress database backend does not really make sense. Moreover, with greycite, there is a reasonable likelihood that others will have cited a particular article, thereby sharing the load.

Third, Greycite is also useful outside of WordPress. So, for instance, Greycite also provides bibtex so can be used with a bibliographic manager, which is very useful at authoring time, as we can use this metadata to search over a list of relevant URLs, and then to select between then.

Finally, we wanted to be able to add additional functionality, which may require upgrading the database periodically, which is harder to do within a plugin. For example, we have already added links through to the UK Web Archive (http://www.webarchive.org.uk/ukwa/), for those resources which are archived. We will add the Internet Archive (http://www.archive.org/), and Web Cite (http://www.webcitation.org/) in time also. This means that not only should citations remain displayed correctly if resources disappear or change, it should still be possible to get to their contents in many cases.


The article as a linked data

The existence of Greycite allows us to turn a blog post into a linked data, academic article. The reader of an article sees as well as the content directly generated by the author, data gathered from all the outgoing links. The reference list, therefore ceases to be a mechanism for finding secondary sources, and becomes a usability tool; readers can understand what sources are being relied on, without having to remember URLs or click through to them. Likewise, the authors can use the linked data environment outside of a web browser to help enable authoring. Metadata that is useful to readers is, unsurprisingly, also useful to authors (who tend to be the first person to read an article anyway!).


Discussion

With Greycite, we were interested in adding more formal citation to the web in general, and more specifically supporting kcite (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/). We believe that we have achieved this in part with a relatively light-weight service. Greycite is useful for article display, and for authoring.

In addition, we start to address the issues of link breakage, by building on the back of existing archiving services. Articles will be able to still display article metadata if an article disappears. Future versions of kcite will also redirect links to the nearest web archive when this happens. We have done this without the recourse to secondary identifiers such as a DOI or PURL, which we believe represents a better user experience. Building on the back of existing web archives also addresses a critically scalability issue; the Greycite database needs only to store bibliographic metadata which is likely to remain tractable. From a legal perspective, we also side-step issues of copyright, as gathering metadata alone is likely to be covered by fair dealing clauses.

By depending only on metadata present in the URL itself, we can guarantee that metadata is authoratitive (not, of course, that it is “correct”, as in reflects the authors intentions, but it does match what they said). It also means that we do not control the metadata; it has not been entered into greycite; it is out there, available on the web, free for anyone to gather. We wish to be part of the semantic web, not a walled garden within it.

Finally, we have started to build a linked data environment for academic publishing. Bibliographic metadata is, of course, only the start. It is not a suitable way to present all kinds of information; for instance, Chemicalize (http://www.chemicalize.org/) provide a nice plugin which transforms chemical names into something richer. But by harnessing the power of the web, and building on existing resources, we should be able to build a rich and full featured environment for presenting scientific knowledge.

Bibliography