Archive for the ‘All’ Category

I am pleased to announce that as part of my work on knowledgeblog (http://www.knowledgeblog.org/), we now have two new tools — Greycite and kblog-metadata — and have extended kcite, our citation engine (http://knowledgeblog.org/kcite-plugin). I will just give a brief overview here of the functionality here. Subsequent articles will describe these tools in more detail, explaining the rationale behind them.

The kcite engine, which you can see in use in this article, produces a nicely formatted bibliography list, generated using only identifiers to these articles: DOIs, Pubmed IDs or arXiv IDs. One obvious absence from this list, however, is the ability to directly cite URLs. We have now started to address this, through our two new tools.

Unlike other identifiers, we lack a centralised resource capable of delivering bibliographic metadata about a URL. To enable this, my colleague, Lindsay Marshall (http://www.ncl.ac.uk/computing/staff/profile/lindsay.marshall), has developed Greycite (http://greycite.knowledgeblog.org/), which went live earlier this week. Greycite allows you to search for bibliographic metadata about a given resource. So, for instance, you can view the metadata for my article on realism (http://www.russet.org.uk/blog/2010/07/realism-and-science/). Probably more usefully than this view, however, is that you can also retrieve this metadata computationally: currently, we support JSON suitable for citeproc-js (http://bitbucket.org/fbennett/citeproc-js), and bibtex (http://www.bibtex.org/). Obviously, we can support further formats if we choose; fortunately, the metadata for a URL is, in general, very simple (date, title, website or “container” title).

Greycite must, however, get its metadata from somewhere. As we wanted greycite to be both an automated and authoratitive source, we have decided to take metadata only from the URL being referenced (or referenced from the URL). Anything else would have required an authentication step, to prove that metadata was being provided by the owner of the content. I will describe this in more detail later; we support COiNS (http://ocoins.info/), OGP (http://ogp.me/) and Google Scholar Metatags (http://scholar.google.com/intl/en/scholar/inclusion.html). In practice, this combination of sources allows us to provide rich references to many URLs. Where not, we fallback gracefully.

Unfortunately, formal metadata on the web is not heavily controlled or pre-defined. If you are using WordPress to publish your articles, it is largely dependant on your theme as to whether there is any metadata on your articles. I have started to address this with kblog-metadata (http://wordpress.org/extend/plugins/kblog-metadata/). Again, I will describe the functionality in greater detail later, but essentially, this plugin adds metadata in all three of the formats mentioned above in the document headers, and provides a good deal of flexibility about where that metadata comes from.

Finally, I have extended kcite to query for metadata from greycite for each URL cited. The data coming back is used directly for rendering, so this should have reasonable performance; moreover all data is cached in the WordPress database, limiting outgoing network traffic from the webserver for each reference.

Work is not complete yet, and there is much more to do. However, I have been using development versions of these tools now for a month or so, and the experience is rather good. The metadata is useful during authoring, as it can be used to find the correct reference. While we cannot capture metadata from all sources, a surprisingly large number of them do work. And the development of greycite means that this metadata can be served efficiently and without adding too much complexity to kcite. In short, while it may not be a complete solution, these enhancements represent a substantial step toward making academic URLs formally citable, as others have recently called for (http://michaelnielsen.org/blog/is-scientific-publishing-about-to-be-disrupted/).


Addendum

2012-05-09: I have already published an initial article (http://www.russet.org.uk/blog/2012/03/kblog-metadata/) about kblog-metadata, which should have been referenced here.

Bibliography

In this article, we will describe the rationale behind our new service, Greycite, that we have developed in general enable more formal citation of URLs, and specifically to back up the kcite citation engine.


Authors

Phillip Lord and Lindsay Marshall
School of Computing Science
Newcastle University


Introduction

As has been recently announced (http://www.russet.org.uk/blog/2012/05/kcite-greycite-and-kblog-metadata/), the kcite citation engine (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/), now supports URLs directly, as can be seen in this sentence. While it can do this trivially, by simply putting a URL in the reference, we wanted something better; where possible, we wanted URLs to be referenced in a similar manner to arXiv (http://arxiv.org/) or PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) IDs — with full bibliographic metadata where possible.

To achieve this, we have created the Greycite service, which captures metadata from a URL and then presents this back to kcite. In this short article, we describe the rationale behind the creation of this service.


Discovering the metadata

The kcite citation engine allows WordPress users to reference an article through the use of a shortcode, of the form [‌cite]10.1371/journal.pone.0012258[/cite‍] which is rendered as (http://dx.doi.org/10.1371/journal.pone.0012258). The rendering uses metadata from a third party service, in this case provided by CrossRef (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html), to generate the bibliography reference. Other identifiers are handled similarly, using other services.

We wished to achieve something similar with an arbitrary URL. However, there is no centralised service where authors are required to lodge their metadata for any URL. We considered the possibility of providing such a service where content authors could lodge their metadata — author, date, title and so on, about a URL. However, it seems unlikely that this would succeed for two critical reasons. First, and most importantly, few authors would be likely to go the extra effort: why would they bother, and if they did why use our service rather than some other. Second, it would require a authentication step to ensure that metadata genuinely came from the person controlling the URL. We also considered the possibility of deliberately allowing third party addition of metadata, but this raises the question of conflicts in the metadata.

As a result, in practice, we feel that the only sensible cause of action is to extract the metadata directly from the resolvable contents of the URL, as this ensures that we have taken metadata from what is (quite literally) the authoratitive source. The significant drawback to this is that if the author does not provide this metadata, no one else is able to do so. In a sense, though, this is correct: if authors provide no metadata, then this is how their works should appear, as this is their choice. Moreover, as we have argued previously (http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/), if authors or their readers are worried by this, it may provide the motivation to add bibliographic metadata to their work which is a benefit to everyone.

The immediate problem here is the lack of single standardised bibliographic metadata on the web; however, there are a number of systems which are currently in use, namely, COinS (http://ocoins.info/), Open Graph Protocol (http://ogp.me/) and Google Scholar tags (http://scholar.google.com/intl/en/scholar/inclusion.html). We also have also considered a fourth option which is RSS/Atom feeds which, perhaps ironically, are structured enough to provide bibliographic metadata. At the moment, we do not have accurate statistics on the prevelance of each of these types of metadata — of course, we could crawl the web to gather these statistics, but we are not really interested in the web in general, but in the academic sector of it which is hard to determine a priori. However, our initial experiences suggest the following:

  • COinS metadata is not widespread. We suspect that this follows from our experience that the specification is hard to find and incomprehensible when you do (http://www.russet.org.uk/blog/2012/03/kblog-metadata/).
  • Google Scholar tags are much more widespread, although there is some variation (the use of name vs property for instance, or multiple authors represented in a single tag vs each author on their own).
  • OGP appears reasonably widespread, including in articles which are not academic (or not solely so) but likely to be cited, such as BBC News, or anything hosted on WordPress.com.
  • RSS/Atom worked fairly well, however normally only contain metadata for recent articles; we tried to track RSS feeds, but this resulted in 1000s of URLs very quickly.

Over time, we should be able to get clearer statistics as to real usage of these systems, based on the data in greycite.


Greycite as a service

Greycite is currently packaged as a service, rather than embedded within WordPress, which would also have been possible. The reasons for this were several. First, gathering metadata involves a reasonable amount of parsing, and putting this all into a WordPress plugin seemed unnecessarily heavy. This is particularly so, given that server load is already an issue with kcite, and adding further to this did not seem sensible.

Second, we wanted to maintain a database of the metadata gathered from around the web. This allows us to deal with problems of resources changing or disappearing. We want the user to be able to cite a URL and for this citation to not break if the URL disappears and becomes 404. We also wish to be able to cite a URL at a specific date, and have the citation show the metadata for that time. Placing this load on the individual wordpress database backend does not really make sense. Moreover, with greycite, there is a reasonable likelihood that others will have cited a particular article, thereby sharing the load.

Third, Greycite is also useful outside of WordPress. So, for instance, Greycite also provides bibtex so can be used with a bibliographic manager, which is very useful at authoring time, as we can use this metadata to search over a list of relevant URLs, and then to select between then.

Finally, we wanted to be able to add additional functionality, which may require upgrading the database periodically, which is harder to do within a plugin. For example, we have already added links through to the UK Web Archive (http://www.webarchive.org.uk/ukwa/), for those resources which are archived. We will add the Internet Archive (http://www.archive.org/), and Web Cite (http://www.webcitation.org/) in time also. This means that not only should citations remain displayed correctly if resources disappear or change, it should still be possible to get to their contents in many cases.


The article as a linked data

The existence of Greycite allows us to turn a blog post into a linked data, academic article. The reader of an article sees as well as the content directly generated by the author, data gathered from all the outgoing links. The reference list, therefore ceases to be a mechanism for finding secondary sources, and becomes a usability tool; readers can understand what sources are being relied on, without having to remember URLs or click through to them. Likewise, the authors can use the linked data environment outside of a web browser to help enable authoring. Metadata that is useful to readers is, unsurprisingly, also useful to authors (who tend to be the first person to read an article anyway!).


Discussion

With Greycite, we were interested in adding more formal citation to the web in general, and more specifically supporting kcite (http://www.russet.org.uk/blog/2011/12/kcite-the-next-generation/). We believe that we have achieved this in part with a relatively light-weight service. Greycite is useful for article display, and for authoring.

In addition, we start to address the issues of link breakage, by building on the back of existing archiving services. Articles will be able to still display article metadata if an article disappears. Future versions of kcite will also redirect links to the nearest web archive when this happens. We have done this without the recourse to secondary identifiers such as a DOI or PURL, which we believe represents a better user experience. Building on the back of existing web archives also addresses a critically scalability issue; the Greycite database needs only to store bibliographic metadata which is likely to remain tractable. From a legal perspective, we also side-step issues of copyright, as gathering metadata alone is likely to be covered by fair dealing clauses.

By depending only on metadata present in the URL itself, we can guarantee that metadata is authoratitive (not, of course, that it is “correct”, as in reflects the authors intentions, but it does match what they said). It also means that we do not control the metadata; it has not been entered into greycite; it is out there, available on the web, free for anyone to gather. We wish to be part of the semantic web, not a walled garden within it.

Finally, we have started to build a linked data environment for academic publishing. Bibliographic metadata is, of course, only the start. It is not a suitable way to present all kinds of information; for instance, Chemicalize (http://www.chemicalize.org/) provide a nice plugin which transforms chemical names into something richer. But by harnessing the power of the web, and building on existing resources, we should be able to build a rich and full featured environment for presenting scientific knowledge.

Bibliography

I have a PhD studentship available for anyone wishing to work on using the Semantic Web and linked data to improve the process of scientific publishing.

I want to expand on the work that we have done with Kcite (http://www.russet.org.uk/blog/2012/02/kcite-spreads-its-wings/), which links between different articles, and consider how we would link to and from both raw data and ontological resources. We will do this in a practical, real-world environment: we will be extending WordPress server-side; all the tools that we generate we will be released as we go into the “wild”; we will be active at supporting users so that we can incorporate feedback. We will be targetting the academic blogosphere, in addition to working with the content on http://knowledgeblog.org.

If you are interested, please feel free to email me directly. The full details of the advert are below.


Advert

http://www.ncl.ac.uk/postgraduate/funding/search/list/cs024

The linked data initiative seeks to increase the machine computability of the web, but it is hard for authors to generate linked-data. We will investigate ways of publishing scientific knowledge where authors, readers and computational agents all gain advantage from additional semantics and machine computability. We will investigate representation of graph data and deep linking to ontological resources.

This project will feed into Knowledgeblog (http://knowledgeblog.org) which is both a high traffic (100k+ page reads) academic site in its own right, as well releasing its software for third party use by the academic blogosphere. Combined with exemplars using real data where possible, this will provide two valuable routes to evaluate and assess the representations in a real-world environment.

Value of the Award of Eligibility

Depending on how you meet the EPSRC criteria (http://www.epsrc.ac.uk/funding/students/pages/eligibility.aspx.) you may be entitled to a full or partial award. A full award covers tuition fees at the UK/EU rate and an annual stipend of £14,790 (2012/13). A partial award covers fees at the UK/EU rate only. The studentship is not available for candidates from outside of the EU.

Person Specification

You should have either a First class honours degree in Computing Science, Mathematics, or other relevant science or engineering subject, or a or 2.1 in Computing Science, Mathematics or other relevant science or engineering subject and a distinction level Masters degree in a related subject. Equivalent experience will also be considered.

How to Apply

Apply through the University’s online postgraduate application form insert the reference CS024 and select ‘PhD COMP’, with programme code 8050F, as programme of study. Mandatory fields need to be completed and a covering letter, CV and (if English is not your first language) a copy of your English language qualifications attached. The letter must state the title of studentship, quote reference CS024 and describe how your research interests fit with the topic of the research projected outlined (max. 2 pages). If you already have published research papers a list of these providing bibliographic details should be included in the letter.

You should also send your covering letter and CV to the Postgraduate Secretary at cs.pg@ncl.ac.uk.

Further Information

For further details, please contact Phillip Lord (phillip.lord@newcastle.ac.uk), 0191 222 7827

Bibliography

Phillip Lord, Simon Cockell and Robert Stevens
School of Computing Science, Newcastle University,
Newcastle-upon-Tyne, UK
Bioinformatics Support Unit, Newcastle University,
Newcastle-upon-Tyne, UK
School of Computer Science, University of Manchester, UK
phillip.lord@newcastle.ac.uk

Semantic publishing offers the promise of computable papers, enriched visualisation and a realisation of the linked data ideal. In reality, however, the publication process contrives to prevent richer semantics while culminating in a ‘lumpen’ PDF. In this paper, we discuss a web-first approach to publication, and describe a three-tiered approach which integrates with the existing authoring tooling. Critically, although it adds limited semantics, it does provide value to all the participants in the process: the author, the reader and the machine.

License: This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/. It is also available at http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/. It was written for SePublica 2012.

1 Introduction

The publishing of both data and narratives on those data are changing radically. Linked Open Data and related semantic technologies allow for semantic publishing of data. We still need, however, to publish the narratives on that data and that style of publishing is in the process of change; one of those changes is the incorporation of semantics (http://dx.doi.org/10.1109/MIS.2006.62)(http://dx.doi.org/10.1087/2009202)(http://dx.doi.org/10.1371/journal.pcbi.1000361). The idea of semantic publishing is an attractive one for those who wish to consume papers electronically; it should enhance the richness of the computational component of papers (http://dx.doi.org/10.1087/2009202). It promises a realisation of the vision of a next generation of the web, with papers becoming a critical part of a linked data environment (http://dx.doi.org/10.1109/MIS.2006.62),(http://dx.doi.org/10.4018/jswis.2009081901), where the results and naratives become one.

The reality, however, is somewhat different. There are significant barriers to the acceptance of semantic publishing as a standard mechanism for academic publishing. The web was invented around 1990 as a light-weight mechanism for publication of documents. It has subsequently had a massive impact on society in general. It has, however, barely touched most scientific publishing; while most journals have a website, the publication process still revolves around the generation of papers, moving from Microsoft Word or LaTeX (http://www.latex-project.org), through to a final PDF which looks, feels and is something designed to be printed onto paper (this includes conferences dedicated to the web and the use of web technologies). Adding semantics into this environment is difficult or impossible; the content of the PDF has to be exposed and semantic content retro-fitted or, in all likelihood, a complex process of author and publisher interaction has to be devised and followed. If semantic data publishing and semantic publishing of academic narratives are to work together, then academic publishing needs to change.

In this paper, we describe our attempts to take a commodity publication environment, and modify it to bring in some of the formality required from academic publishing. We illustrate this with three exemplars—different kinds of knowledge that we wish to enhance. In the process, we add a small amount of semantics to the finished articles. Our key constraint is the desire to add value for all the human participants. Both authors and readers should see and recognise additional value, with the semantics a useful or necessary byproduct of the process, rather than the primary motivation. We characterise this process as our “three steps to heaven”, namely:

  • make life better for the machine to

  • make life better for the author to

  • make life better for the reader

While requiring additional value for all of these participants is hard, and places significant limitations on the level of semantics that can be achieved, we believe that it does increase the likelihood that content will be generated in the first place, and represents an attempt to enable semantic publishing in a real-world workflow.

2 Knowledgeblog

The knowledgeblog project stemmed from the desire for a book describing the many aspects of ontology development, from the underlying formal semantics, to the practical technology layer and, finally, through to the knowledge domain (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/). However, we have found the traditional book publishing process frustrating and unrewarding. While scientific authoring is difficult in its own right, our own experience suggests that the publishing process is extremely hard-work. This is particularly so for multi-author collected works which are often harder for the editor than writing a book “solo”. Finally, the expense and hard copy nature of academic books means that, again in our experience, few people read them.

This contrasts starkly with the web-first publication process that has become known as blogging. With any of a number of ready made platforms, it is possible for authors with little or no technical skill, to publish content to the web with ease. For knowledgeblog (“kblog”), we have taken one blogging engine, WordPress (http://www.wordpress.org), running on low-end hardware, and used it to develop a multi-author resource describing the use of ontologies in the life sciences (our main field of expertise). There are also kblogs on bioinformatics (http://bioinformatics.knowledgeblog.org) and the Taverna workflow environment (http://taverna.knowledgeblog.org)(http://dx.doi.org/10.1093/nar/gkl320). We have previously described how we addressed some of the social aspects, including attribution, reviewing and immutablity of articles (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/)

As well as delivering content, we are also using this framework to investigate semantic academic publishing, investigating how we can enhance the machine interpretability of the final paper, while living within the key constraint of making life (slightly) better for machine, author and reader without adding complexity for the human participants.

Scientific authors are relatively conservative. Most of them have well-established toolsets and workflows which they are relatively unwilling to change. For instance, within the kblog project, we have used workshops to start the process of content generation. For our initial meeting, we gave little guidance on authoring process to authors, as a result of which most attempted to use WordPress directly for authoring. The WordPress editing environment is, however, web-based, and was originally designed for editing short, non-technical articles. It appeared to not work well for most scientists.

The requirements that authors have for such ‘scientific’ articles are manifold. Many wish to be able to author while offline (particularly on trains or planes). Almost all scientific papers are multi-author, and some degree of collaboration is required. Many scientists in the life sciences wish to author in Word because grant bodies and journals often produce templates as Word documents. Many wish to use LaTeX, because its idiomatic approach to programming documents is unreplicable with anything else. Fortunately, it is possible to induce WordPress to accept content from many different authoring tools, including Word and LaTeX (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/)

As a result, during the kblog project, we have seem many different workflows in use, often highly idiosyncratic in nature. These include:

Word/Email:

Many authors write using MS Word and collaborate by emailing files around. This method has a low barrier to entry, but requires significant social processes to prevent conflicting versions, particularly as the number of authors increases.

Word/Dropbox:

For the taverna kblog (http://taverna.knowledgeblog.org), authors wrote in Word and collaborated with Dropbox (http://www.dropbox.com). This method works reasonably well where many authors are involved; Dropbox detects conflicts, although it cannot prevent or merge them.

Asciidoc/Dropbox:

Used by the authors of this paper. Asciidoc (http://www.methods.co.nz/asciidoc) is relatively simple, somewhat programmable and accessible. Unlike LaTeX which can be induced to produce HTML with effort, asciidoc is designed to do so.

Of these three approaches probably the Word/Dropbox combination is the the most generally used.

From the readers perspective, a decision that we have made within knowledgeblog is to be “HTML-first”. The initial reasons for this were entirely practical; supporting multiple toolsets is hard, particularly if any degree of consistency is to be maintained; the generation of the HTML is at least partly controlled by the middleware – WordPress in kblog’s case. As well as enabling consistency of presentation, it also, potentially, allows us to add additional knowledge; it makes semantic publication a possibility. However, we are aware that knowledgeblog currently scores rather badly on what we describe as the “bath-tub test”; while exporting to PDF or printing out is possible, the presentation is not as “neat” as would be ideal. In this regard (and we hope only in this regard), the knowledgeblog experience is limited. However, increasingly, readers are happy and capable of interacting with material on the web, without print outs.

From this background and aim, we have drawn the following requirements:

  1. The author can, as much as possible, remain within familiar authoring environments;

  2. The representation of the published work should remain extensible to, for instance, semantic enhancements;

  3. The author and reader should be able to have the amount of “formal” academic publishing they need;

  4. Support for semantic publishing should be gradual and offer advantages for author and reader at all stages.

We describe how we have achieved this with three exemplars, two of which are relatively general in use, and one more specific to biology. In each case, we have taken a slightly different approach, but have fulfilled our primary aim of making life better for machine, author and reader.

3 Representing Mathematics

The representation of mathematics is a common need in academic literature. Mathematical notation has grown from a requirement for a syntax which is highly expressive and relatively easy to write. It presents specific challenges because of its complexity, the difficulty of authoring and the difficulty of rendering, away from the chalk board that is its natural home.

Support for mathematics has had a significant impact on academic publishing. It was, for example, the original motivation behind the development of TeX (http://en.wikipedia.org/wiki/TeX), and it still one of the main reasons why authors wish to use it or its derivatives. This is to such an extent that much mathematics rendering on the web is driven by a TeX engine somewhere in the process. So MediaWiki (and therefore Wikipedia), Drupal and, of course, WordPress follow this route. The latter provides plugin support for TeX markup using the wp-latex plugin (http://wordpress.org/extend/plugins/wp-latex/). Within kblog, we have developed a new plugin called mathjax-latex (http://wordpress.org/extend/plugins/mathjax-latex/) From the kblogauthor’s perspective these two offer a similar interface – differences are, therefore, described later.

Authors write their mathematics directly as TeX using one of the four markup syntaxes. The most explicit (and therefore least likely to happen accidentally) is through the use of “shortcodes” (http://codex.wordpress.org/Shortcode).

These are a HTML-like markup originating from some forum/bulletin board systems. In this form an equation would be entered as [latex]e=mc^2[/latex], which would be rendered as “\(e=mc^2\)”. It is also possible to use three other syntaxes which are closer to math-mode in TeX: $‍$e=mc^2$‍$, $latex e=mc^2$, or \‍[e=mc^2\‍].

From the authorial perspective, we have added significant value, as it is possible to use a variety of syntaxes, which are independent of the authoring engine. For example, a TeX-loving mathematician working with a Word-using biologist can still set their equations using TeX syntax; although Word will not render these at authoring time but, in practice, this causes few problems for such authors, who are experienced at reading TeX. Within an LaTeX workflow equations will be renderable both locally with source compiled to PDF, and published to WordPress.

There is also a W3C recommendation, MathML for the representation and presentation of mathematics. The kblog environment also supports this. In this case, the equivalent source appears as follows:

 <math>
 <mrow>
<mi>E</mi>
 <mo>=</mo>
 <mrow>
<mi>m</mi>
 <msup>
 <mi>c</mi>
<mn>2</mn>
 </msup>
 </mrow>
 </mrow>
</math>

One problem with the MathML representation is obvious: it is very long-winded. A second issue, however, is that it is hard to integrate with existing workflows; most of the publication workflows we have seen in use will on recognising an angle bracket turn it into the equivalent HTML entity. For some workflows (LaTeX, asciidoc) it is possible, although not easy, to prevent this within the native syntax.

It is also possible to convert from Word’s native OMML (“equation editor”) XML representation to MathML, although this does not integrate with Word’s native blog publication workflow. Ironically, it is because MathML shares an XML based syntax with the final presentation format (HTML) that the problem arises. The shortcode syntax, for example, passes straight-through most of the publication frameworks to be consumed by the middleware. From a pragmatic point of view, therefore, supporting shortcodes and TeX-like syntaxes has considerable advantages.

For the reader, the use of mathjax-latex has significant advantages. The default mechanism within WordPress uses a math-mode like syntax $‍latex e=mc^2‍$. This is rendered using a TeX engine into an image which is then incorporated and linked using normal HTML capabilities. This representation is opaque and non-semantic; it has significant limitations for the reader. The images are not scalable – zooming in cases severe pixalation; the background to the mathematics is coloured inside the image, so does not necessarily reflect the local style.

Kblog, however, uses the MathJax library (http://www.mathjax.org) this has a number of significant advantages for the reader. First, where the browser supports them, MathJax uses webfonts to render the images; these are scalable, attractive and standardized. Where they are not available, MathJax can fall-back to bitmapped fonts. The reader can also access additional functionality: clicking on an equation will raise a zoomed in popup; while the context menu allows access to a textual representation either as TeX or MathML irrespective of the form that the author used. This can be cut-and-paste for further use. Kblog uses the MathJax library (http://www.mathjax.org) to render the underlying TeX directly on the client.

Our use of MathJax provides no significant disadvantages to the middleware layers. It is implemented in JavaScript and runs in most environments. Although, the library is fairly large (>100Mb), but is available on a CDN so need not stress server storage space. Most of this space comes from the bit-mapped fonts which are only downloaded on-demand, so should not stress web clients either. It also obviates the need for a TeX installation which wp-latex may require (although this plugin can use an external server also).

At face value, mathjax-latex necessarily adds very little semantics to the maths embedded within documents. The maths could be represented as $‍$E=mc^2$‍$, \‍(E=mc^2\‍) or

<math> <mrow> <mi>E</mi> <mo>=</mo>
<mrow> <mi>m</mi>
 <msup>
<mi>c</mi><mn>2</mn> </msup>
 </mrow>
</mrow> </math>

So, we have a heterogenous representation for identical knowledge. However, in practice, the situation is much better than this. The author of the work created these equations and has then read them, transformed by MathJax into a rendered form. If MathJax has failed to translate them correctly, in line with the author’s intention, or if it has had some implications for the text in addition to setting the intended equations (if the TeX style markup appears accidentally elsewhere in the document), the author is likely to have seen this and fixed the problem. Someone wishing, for example, to extract all the mathematics as MathML from these documents computationally, therefore, knows:

  • that the document contains maths as it imports MathJax

  • that MathJax is capable of identifying this maths correctly

  • that equations can be transformed to MathML using MathJax (This is assuming MathJax works correctly in general. The authors and readers are checking the rendered representation. It is possible that an equation would render correctly on screen, but be rendered to MathML inaccurately).

So, while our publication environment does not result directly in lower level of semantic heterogeneity, it does provide the data and the tools to enable the computational agent to make this transformation. While this is imperfect, it should help a bit. In short, we provide a practical mechanism to identify text containing mathematics and a mechanism to transform this to a single, standardised representation.

4 Representing References

Unlike mathematics, there is no standard mechanism for reference and in-text citation, but there are a large number of tools for authors such as BibTeX, Mendeley (http://www.mendeley.org) or EndNote. As a result of this, the integration with existing toolsets is of primary importance, while the representation of the in-text citations is not, as it should be handled by the tool layer anyway.

Within kblog, we have developed a plugin called kcite (http://wordpress.org/extend/plugins/kcite/). For the author, citations are inserted using the syntax:[‍cite]10.1371/journal.pone.0012258[‍/cite]. The identifier used here is a DOI, or digital object identifier and, is widely used within the publishing and library industry. Currently, kcite supports DOIs minted by either CrossRef (http://www.crossref.org) or DataCite (http://www.datacite.org) (in practice, this means that we support the majority of DOIs). We also support identifiers from PubMed (http://www.pubmed.org) which covers most biomedical publications and arXiv (http://www.arxiv.org), the physics (and other domains!) preprints archive, and we now have a system to support arbitrary URLs. Currently, authors are required to select the identifier where it is not a DOI.

We have picked this “shortcode” format for similar reasons as described for maths; it is relatively unambiguous, it is not XML based, so passes through the HTML generation layer of most authoring tools unchanged and is explicitly supported in WordPress, bypassing the need for regular expressions and later parsing. It would, however, be a little unwieldy from the perspective of the author. In practice, however, it is relatively easy to integrate this with many reference managers. For example, tools such as Zotero (http://www.zotero.org) and Mendeley use the Citation Style Language, and so can output kcite compliant citations with the following slightly elided code:

 <citation>
    <layout prefix="[‍cite]" suffix="[‍/cite]"
         delimiter="[‍/cite] [‍cite]">
      <text variable="DOI"/>
    </layout>
  </citation>

We do not yet support LaTeX/BibTeX citations, although we see no reason why a similar style file should not be supported (citations in this representation of the article were, rather painfully, converted by hand). We do, however, support BibTeX-formatted files: the first author’s preferred editing/citation environment is based around these with Emacs, RefTeX, and asciidoc. While this is undoubtedly a rather niche authoring environment, the (slightly elided) code for supporting this demonstrates the relative ease with which tool chains can be induced to support kcite:

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
  (if phil-reftex-citation-override
      (setq ad-return-value (phil-reftex-format-citation entry format))
    ad-do-it))

(defun phil-reftex-format-citation( entry format )
  (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[‍[‍cite source='doi'\\]%s[‍/cite\\]]" doi)))

The key decision with kcite from the authorial perspective is to ignore the reference list itself and focus only on in-text citations, using public identifiers to references. This simplifies the tool integration process enormously, as this is the only data that needs to pass from the author’s bibliographic database onward. The key advantage for authors here is two-fold: they are not required to populate their reference metadata for themselves, and this metadata will update if it changes. Secondly, the identifiers are checked; if they are wrong, the authors will see this straightforwardly as the entire reference will be wrong. Adding DOIs or other identifiers moves from becoming a burden for the author to becoming a specific advantage.

While supporting multiple forms of reference identifier (CrossRef DOI, DataCite DOI, arXiv and PubMed ID) provides a clear advantage to the author, it comes at considerable cost. While it is possible to get metadata about papers from all of these sources, there is little commonality between them. Moreover, resolving this metadata requires one outgoing HTTP request per reference (in practice, it is often more; DOI requests, for instance use 303 redirects), which browser security might or might not allow.

So, while the presentation of mathematics is performed largely on the client, for reference lists the kcite plugin performs metadata resolution and data integration on the server. A caching functionality is provided, storing this metadata in the WordPress database. The bibliographic metadata is finally transferred to the client encoded as JSON, using asynchronous call-backs to the server.

Finally, this JSON is rendered using the citeproc-js library on the client. In our experience, this performs well, adding to the readers’ experience; in-text citations are initially shown as hyperlinks; rendering is rapid, even on aging hardware, and finally in-text citations are linked both to the bibliography and directly through to the external source. Currently, the format of the reference list is fixed, however, citeproc-js is a generalised reference processor, driven using CSL (http://citationstyles.org/). This makes it straight-forward to change citation format, at the option of the reader, rather than the author or publisher. Both the in-text citation and bibliography support outgoing links direct to the underlying resources (where the identifier allows — PubMed IDs redirect to PubMed). As these links have been used to gather metadata, they are likely to be correct. While these advantages are relatively small currently, we believe that the use of JavaScript rendering over a linked references can be used to add further reader value in future.

For the computational agent wishing to consume bibliographic information, we have added significant value compared to the pre-formatted HTML reference list. First, all the information required to render the citation is present in the in-text citation next to the text that the authors intended. A computational agent can, therefore, ignore the bibliography list itself entirely. These primary identifiers are, again, likely to be correct because the authors now need them to be correct for their own benefit.

Should the computational agent wish, the (denormalised) bibliographic data used to render the bibliography is actually available, present in the underlying HTML as a JSON string. This is represented in a homogeneous format, although, of course, represents our (kcite’s) interpretation of the primary data.

A final, and subtle, advantage of kcite is that the authors can only use public metadata, and not their own. If they use the correct primary identifier, and still get an incorrect reference, it follows that the public metadata must be incorrect (or, we acknowledge, that kcite is broken!). Authors and readers therefore must ask the metadata providers to fix their metadata to the benefit of all. This form of data linking, therefore, can even help those who are not using it.

4.1 Microarray Data

Many publications require that papers discussing microarray experiments lodge their data in a publically available resource such as ArrayExpress (http://dx.doi.org/10.1093/nar/gkg091). Authors do this placing an ArrayExpress identifier which has the form E-MEXP-1551. Currently, adding this identifier to a publication, as with adding the raw data to the repository is no direct advantage to the author, other than fulfilment of the publication requirement. Similarly, there is no existing support within most authoring environments for adding this form of reference.

For the knowledgeblog-arrayexpress plugin (http://knowledgeblog.org/knowledgeblog-arrayexpress), therefore, we have again used a shortcode representation, but allowed the author to automatically fill metadata, direct from ArrayExpress. So a tag such as:[‍aexp id="E-MEXP-1551"]species[‍/aexp] will be replaced with Saccharomyces cerevisiae, while:[‍aexp id="E-MEXP-1551"]releasedate[‍/aexp] will be replaced by “2010-02-24”. While the advantage here is small, it is significant. Hyperlinks to ArrayExpress are automatic, authors no longer need to look up detailed metadata. For metadata which authors are likely to know anyway (such as Species), the automatic lookup operates as a check that their ArrayExpress ID is correct. As with references (see Section ), the use of an identifier becomes an advantage rather than a burden to the authors.

Currently, for the reader there is less significant advantage at the moment. While there is some value to the author of the added correctness stemming from the ArrayExpress identifier. However, knowledgeblog-arrayexpress is currently under-developed, and the added semantics that is now present could be used more extensively. The unambiguous knowledge that:[‍aexp id="E-MEXP-1551"]species[‍/aexp] represents a species would allow us, for example, to link to the NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/).

Likewise, advantage for the computational agent from knowledgeblog­-array­express is currently limited; the identifiers are clearly marked up, and as the authors now care about them, they are likely to be correct. Again, however, knowledgeblog­-array­express is currently under developed for the computational agent. The knowledge that is extracted from ArrayExpress could be presented within the HTML generated by knowledgeblog­-array­express, whether or not it is displayed to the reader for, essentially no cost. By having an underlying shortcode representation, if we choose to add this functionality to knowledgeblog­-array­express, any posts written using it would automatically update their HTML. For the text-mining bioinformatician, even the ability to unambiguously determine that a paper described or used a data set relating to a specific species using standardised nomenclature (the standard nomenclature was only invented in 1753 and is still not used universally) would be a considerable boon.

5 Discussion

Our approach to semantic enrichment of articles is a measured and evolutionary approach. We are investigating how we can increase the amount of knowledge in academic articles presented in a computationally accessible form. However, we are doing so in an environment which does not require all the different aspects of authoring and publishing to be over-turned. More over, we have followed a strong principle of semantic enhancement which offers advantages to both reader and author immediately. So, adding references as a DOI, or other identifier, ‘automagically’ produces an in text citation and a nicely formatted reference list: that the reference list is no longer present in the article, but is a visualisation over linked data; that the article itself has become a first class citizen of this linked data environment is a happy by-product.

This approach, however, also has disadvantages. There are a number of semantic enhancements which we could make straight-forwardly to the knowledgeblog environment that we have not; the principles that we have adopted requires significant compromise. We offer here two examples.

First, there has been significant work by others on CiTO (http://dx.doi.org/10.1186/2041-1480-1-S1-S6) – an ontology which helps to describe the relationship between the citations and a paper. Kcite lays the ground-work for an easy and straight-forward addition of CiTO tags surrounding each in-text citation. Doing so, would enable increased machine understandability of a reference list. Potentially, we could use this to the advantage to the reader also: we could distinguish between reviews and primary research papers; highlight the authors’ previous work; emphasise older papers which are being refuted. However, to do this requires additional semantics from the author. Although these CiTO semantic enhancements would be easy to insert directly using the shortcode syntax, most authors will want to use their existing reference manager which will not support this form of semantics; even if it does, the author themselves gain little advantage from adding these semantics. There are advantages for the reader, but in this case not for both author and reader. As a result, we will probably add such support to kcite; but, if we are honest, find it unlikely that when acting as content authors, we will find the time to add this additional semantics.

Second, our presentation of mathematics could be modified to automatically generate MathML from any included TeX markup. The transformation could be performed on the server, using MathJax; MathML would still be rendered on the client to webfonts. This would mean that any embedded maths would be discoverable because of the existence of MathML, which is a considerable advantage. However, neither the reader nor the author gain any advantage from doing this, while paying the cost of the slower load times and higher server load that would result from running JavaScript on the server. More over, they would pay this cost regardless of whether their content were actually being consumed computationally. As the situation now stands, the computational user needs to identify the insert of MathJax into the web page, and then transform the page using this library, none of which is standard. This is clearly a serious compromise, but we feel a necessary one.

Our support for microarrays offers the possibility of the most specific and increased level of semantics of all of our plugins. Knowledge about a species or a microarray experimental design can be precisely represented. However, almost by definition, this form of knowledge is fairly niche and only likely to be of relevance to a small community. However, we do note that the knowledgeblog process based around commodity technology does offer a publishing process that can be adapted, extended and specialised in this way relatively easily. Ultimately the many small communities that make up the long-tail of scientific publishing adds up to one large one.

6 Conclusion

Semantic publishing is a desirable goal, but goals need to be realistic and achievable. to move towards semantic publishing in kblog, we have tried to put in place an approach that gives benefit to readers, authors and computational interpretation. As a result, at this stage, we have light semantic publishing, but with small, but definite benefits for all.

Semantics give meaning to entities. In kblog, we have sought benefit by “saying” within the kblog environment that entity x is either maths, a citation or a microarray data entity reference. This is sufficient for the kbloginfra-structure to “know what to do” with the entity in question. Knowing that some publishable entity is a “lump” of maths tells the infra-structure how to handle that entity: the reader has benefit from it looking like maths; the author has benefit by not having to do very much; and the infra-structure knows what to do. In addition, this approach leaves in hooks for doing more later.

It is not necessarily easy to find compelling examples that give advantages for all steps. Adding in CiTO attributes to citations, for instance, has obvious advantages for the reader, but not the author. However, advantages may be indirect; richer reader semantics may give more readers and thus more citations—the thing authors appreciate as much as the act of publishing itself. It is, however, difficult to imagine how such advantages can be conveyed to the author at the point of writing. It is easy to see the advantages of semantic publishing for readers, as a community we need to pay attention to advantages to the authors. Without these “carrots”, we will only have “sticks” and authors, particularly technically skilled ones, are highly adept at working around sticks.

Bibliography

In my previous articles, I have talked about general problems with DOIs (http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/), about architectural issues with capturing metadata (http://www.russet.org.uk/blog/2012/03/dois-and-content-negotiation/), and finally, about specific problem DOIs (http://www.russet.org.uk/blog/2012/03/a-problem-doi).

I have also described part of the difficulty is that it is hard to determine the registration agency associated with a specific DOI — there are actually different kinds of DOI and they respond in different ways.

I have, however, finally found a way to discover who is a responsible for a given DOI. One of my own papers (http://www.jbiomedsem.com/content/1/S1/S7) declares its DOI to be 10.1186/2041-1480-1-S1-S7. Unfortunately, refering to this paper using Kcite (http://www.russet.org.uk/blog/2012/02/kcite-spreads-its-wings/) shows, at the time of writing, an error message (http://dx.doi.org/10.1186/2041-1480-1-S1-S7), nor does the DOI resolve. As this may later be fixed, the error message looks like this:

The DOI you requested --

10.1186/2041-1480-1-S1-S7

-- cannot be found in the Handle System.

Possible reasons for the error are:

    the DOI has not been created
    the DOI is cited incorrectly in your source
    the DOI does not resolve due to a system problem

On filling in the “this DOI does not work” form, the error page redirects to, the another URL at http://notfound.doi.org/DoiError/servlet, which says:

The DOI and comments (if provided) have been logged by CrossRef and forwarded
to the publisher to correct the problem. Possible reasons for the error are:

    the DOI has been created but has not been registered by the publisher
    (this could be an error or it could be a timing issue and the DOI will be
    registered in the next few days)
    the DOI is cited incorrectly in the source
    the DOI does not resolve due to a system problem

Maintaining the integrity of DOIs is very important to CrossRef and we
appreciate your help.

Which suggests to me clear this DOI (should have) been registered with CrossRef. Unfortunately, this only works with DOIs that do not resolve in the first place. Directly accessing the link returns “nothing to see here”.

A little bit of poking around, and I have discovered a few other problematic DOIs, all from the same “special issue”.

I might take this personally, as this includes another paper of mine, although, strangely, two of the DOIs from the same issue do work which also includes one of mine.

All of this demonstrates an advantage of our Kcite tool (http://knowledgeblog.org/kcite-plugin). By actually using primary identifers as part of the authoring process, I have discovered five DOIs, several of which have been on my website for a long time, that are broken. Possibly, the Journal of Biomedical Semantics should take a leaf out of my book. From their web page:

Journal of Biomedical Semantics 2010, 1(Suppl 1):S7 doi:10.1186/2041-1480-1-S1-S7

The electronic version of this article is the complete one and can be found
online at: http://www.jbiomedsem.com/content/1/S1/S7

At the time of writing, the DOI is not displayed as http://dx.doi.org/10.1186/2041-1480-1-S1-S7, although this is a hard requirement from CrossRefs display guidelines (http://www.crossref.org/02publishers/doi_display_guidelines.html). Ironically, CrossRef says “CrossRef DOIs should always be displayed as permanent URLs in the online environment.” This seems to miss the requirement that, in the online environment, DOIs should be hyperlinked, so the Journal of Biomedical Semantics cannot be faulted there. It is a shame, though. If they were hyperlinked by now a web crawler would have discovered the 404.


Update (14/05/2012)

Since I wrote this post, three out of four of the errorneous DOIs that I reported have been fixed. One of them (http://dx.doi.org/10.1186/2041-1480-1-S1-S2) is still broken. As I am using kcite to refer to these DOIs, the references have now automatically updated.

Bibliography

In this article, I consider the problems of semantics-free identifiers in OWL and suggest another (possible) solution to the problem.

The problems of identifiers and their semantics are not new. I have written about these problems previously in the context of: blog permalinks (http://www.russet.org.uk/blog/2011/05/permalink-semantics/); and with conversion between OBO format and Manchester syntax (http://www.russet.org.uk/blog/2009/09/obo-format-and-manchester-syntax/). The basic issue is one of choosing your compromise. Identifiers with semantics in them (which this blog uses although I wish it did not) are considerably more human readable, but are not resiliant to change, as the semantics in the identifiers can become out of date with respect to the content they describe. But neither compromise is entirely satisfactory; we need a more pragmatic approach (http://robertdavidstevens.wordpress.com/2011/05/26/unicorns-in-my-ontology).

Recently, I was looking at the move of the OBI ontology (http://dx.doi.org/10.1186/2041-1480-1-S1-S7) from BFO 1.0 to BFO 2.0. I have commented extensively on BFO before (http://dx.doi.org/10.1371/journal.pone.0012258), (http://www.russet.org.uk/blog/2010/07/realism-and-science/) (http://www.russet.org.uk/blog/2010/09/the-status-quo-farewell-tour-on-realism/), and I was interested in what changes have been made for BFO 2.0.

Unfortunately, it is not that easy to work out. While diffs have never been the most human readable of output, the OBI diffs raise this to a new level Consider this change:

svn diff -r 3424:3425 https://obi.svn.sourceforge.net/svnroot/obi/trunk/src/ontology/branches/obi.owl

@@ -204,7 +197,7 @@
     <owl:ObjectProperty rdf:about="http://purl.obolibrary.org/obo/OBI_0000107">
         <rdfs:label>provides_service_consumer_with</rdfs:label>
         <rdfs:domain rdf:resource="http://purl.obolibrary.org/obo/OBI_0001173"/>
-        <rdfs:subPropertyOf rdf:resource="http://www.obofoundry.org/ro/ro.owl#has_part"/>
+        <rdfs:subPropertyOf rdf:resource="http://purl.obolibrary.org/obo/BFO_0000051"/>
     </owl:ObjectProperty>

Also available here for those without access to a local subversion. The resource previously known as has_part has become the rather more obscure BFO_OOOOO51. In short, BFO has become semantics-free.

In general, I think that this is a good thing. The use of semantics in the identifiers for this blog is generally not helpful, although I have never carried through my year-old threat (http://www.russet.org.uk/blog/2011/05/permalink-semantics/) to change the identifier scheme as I am not sure older links will be maintained. But the total unreadability of the OBI diff demonstrates a problem. One answer is that we should not be reading OWL source in the first place, but using tools. These tools exist (http://www.ebi.ac.uk/efo/bubastis/), in fact, but they are not a replacement for a diff, but a supplement to it. Source code must be in a readable syntax because line-orientated syntax is the lowest common denominator; semantic diffs are nice, but next we would need an OWL aware versioning tool, as versioning depends on diffing. Then OWL aware regexp search and replace tools for when syntactic alterations were needed. Eventually, we would end up replacing an entire software stack and, no doubt, doing it badly, since tools such as versioning software have a long heritage and are now very functional (and incredibly complex!).

My previous, minimal suggestion was to use a denormalisation, by adding a new comment character. So

ObjectProperty http://purl.obolibrary.org/obo/BFO_0000051

would become

ObjectProperty http://purl.obolibrary.org/obo/BFO_0000051[has_part]

The denormalisation here — presenting the same information as an opaque string and as a text string, fulfils both requirements. However it would require significant effort to keep the two in sync.

My new idea would be to use a similar idea to a Colour Lookup Table (http://en.wikipedia.org/wiki/Colour_look-up_table). These are used to define a palette of colours selected from a much larger colour space. We could use a similar approach here. Essentially the idea is to put semantics free IDs at the top of the file, then meaningful ones in the middle. The idea is also similar to the use of abbreviations for namespaces in XML; for instance,

<owl:ObjectProperty rdf:about="http://purl.obolibrary.org/obo/OBI_0000107">

the rdf: prefix actually refers to “http://www.w3.org/1999/02/22-rdf-syntax-ns#”. The letters rdf could be replaced by anything at all, so long as we update the namespace declaration without changing semantics.

In Manchester syntax, we could address this with an addition of an alias keyword. So:

ObjectProperty http://purl.obolibrary.org/obo/OBI_0000107
   Annotations: rdfs:label="provides_service_consumer_with"
   Domain: http://purl.obolibrary.org/obo/OBI_0001173
   SubPropertyOf: http://purl.obolibrary.org/obo/BFO_0000051

would become

Prefix: obo: http://purl.obolibrary.org/obo/
Alias: obo:OBI_0000107 "provides_service_consumer_with"
Alias: obo:OBI_0001173 "service"
Alias: obo:BFO_0000051 "has_part"

ObjectProperty provides_service_consumer_with
   Annotations: rdfs:label="provides_service_consumer_with"
   Domain: service
   SubPropertyOf: has_part

In this case, because we are defining a term and attaching a label we get the same string twice, but there is no formal link between the two. With this system in place, moving the identifiers for BFO would have required an update to only the Alias table at the top. Now an obvious place for the strings to come from would be the source ontology (so “has_part” would come from RO (http://dx.doi.org/10.1186/gb-2005-6-5-r46), or now BFO); this would, in fact, serve as a useful check. If I reference an external ontology and it’s labels do not match with my Alias definitions, I may wish to check to see whether the concepts I have imported still have the semantics that I intended.

The same approach could be directly translated into the XML representation without change, I believe, with the use of XML entities which are defined at the start of an XML document. Of course, this is entirely horrible, and changing the OWL schema would make more sense. Extending Manchester syntax is straight-forward as I think I have shown here. Likewise, for OBO format. And the practical upshot would be a significant increase in the readability of many ontologies without eschewing the good practice of semantics free identifiers.

Bibliography