Archive for the ‘Papers’ Category


Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord


Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.

Phillip Lord, Simon Cockell and Robert Stevens
School of Computing Science, Newcastle University,
Newcastle-upon-Tyne, UK
Bioinformatics Support Unit, Newcastle University,
Newcastle-upon-Tyne, UK
School of Computer Science, University of Manchester, UK
phillip.lord@newcastle.ac.uk

Semantic publishing offers the promise of computable papers, enriched visualisation and a realisation of the linked data ideal. In reality, however, the publication process contrives to prevent richer semantics while culminating in a ‘lumpen’ PDF. In this paper, we discuss a web-first approach to publication, and describe a three-tiered approach which integrates with the existing authoring tooling. Critically, although it adds limited semantics, it does provide value to all the participants in the process: the author, the reader and the machine.

License: This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/. It is also available at http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/. It was written for SePublica 2012.

1 Introduction

The publishing of both data and narratives on those data are changing radically. Linked Open Data and related semantic technologies allow for semantic publishing of data. We still need, however, to publish the narratives on that data and that style of publishing is in the process of change; one of those changes is the incorporation of semantics (10.1109/MIS.2006.62)(10.1087/2009202)(10.1371/journal.pcbi.1000361). The idea of semantic publishing is an attractive one for those who wish to consume papers electronically; it should enhance the richness of the computational component of papers (10.1087/2009202). It promises a realisation of the vision of a next generation of the web, with papers becoming a critical part of a linked data environment (10.1109/MIS.2006.62),(10.4018/jswis.2009081901), where the results and naratives become one.

The reality, however, is somewhat different. There are significant barriers to the acceptance of semantic publishing as a standard mechanism for academic publishing. The web was invented around 1990 as a light-weight mechanism for publication of documents. It has subsequently had a massive impact on society in general. It has, however, barely touched most scientific publishing; while most journals have a website, the publication process still revolves around the generation of papers, moving from Microsoft Word or LaTeX (http://www.latex-project.org), through to a final PDF which looks, feels and is something designed to be printed onto paper (this includes conferences dedicated to the web and the use of web technologies). Adding semantics into this environment is difficult or impossible; the content of the PDF has to be exposed and semantic content retro-fitted or, in all likelihood, a complex process of author and publisher interaction has to be devised and followed. If semantic data publishing and semantic publishing of academic narratives are to work together, then academic publishing needs to change.

In this paper, we describe our attempts to take a commodity publication environment, and modify it to bring in some of the formality required from academic publishing. We illustrate this with three exemplars—different kinds of knowledge that we wish to enhance. In the process, we add a small amount of semantics to the finished articles. Our key constraint is the desire to add value for all the human participants. Both authors and readers should see and recognise additional value, with the semantics a useful or necessary byproduct of the process, rather than the primary motivation. We characterise this process as our “three steps to heaven”, namely:

  • make life better for the machine to

  • make life better for the author to

  • make life better for the reader

While requiring additional value for all of these participants is hard, and places significant limitations on the level of semantics that can be achieved, we believe that it does increase the likelihood that content will be generated in the first place, and represents an attempt to enable semantic publishing in a real-world workflow.

2 Knowledgeblog

The knowledgeblog project stemmed from the desire for a book describing the many aspects of ontology development, from the underlying formal semantics, to the practical technology layer and, finally, through to the knowledge domain (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/). However, we have found the traditional book publishing process frustrating and unrewarding. While scientific authoring is difficult in its own right, our own experience suggests that the publishing process is extremely hard-work. This is particularly so for multi-author collected works which are often harder for the editor than writing a book “solo”. Finally, the expense and hard copy nature of academic books means that, again in our experience, few people read them.

This contrasts starkly with the web-first publication process that has become known as blogging. With any of a number of ready made platforms, it is possible for authors with little or no technical skill, to publish content to the web with ease. For knowledgeblog (“kblog”), we have taken one blogging engine, WordPress (http://www.wordpress.org), running on low-end hardware, and used it to develop a multi-author resource describing the use of ontologies in the life sciences (our main field of expertise). There are also kblogs on bioinformatics (http://bioinformatics.knowledgeblog.org) and the Taverna workflow environment (http://taverna.knowledgeblog.org)(10.1093/nar/gkl320). We have previously described how we addressed some of the social aspects, including attribution, reviewing and immutablity of articles (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/)

As well as delivering content, we are also using this framework to investigate semantic academic publishing, investigating how we can enhance the machine interpretability of the final paper, while living within the key constraint of making life (slightly) better for machine, author and reader without adding complexity for the human participants.

Scientific authors are relatively conservative. Most of them have well-established toolsets and workflows which they are relatively unwilling to change. For instance, within the kblog project, we have used workshops to start the process of content generation. For our initial meeting, we gave little guidance on authoring process to authors, as a result of which most attempted to use WordPress directly for authoring. The WordPress editing environment is, however, web-based, and was originally designed for editing short, non-technical articles. It appeared to not work well for most scientists.

The requirements that authors have for such ‘scientific’ articles are manifold. Many wish to be able to author while offline (particularly on trains or planes). Almost all scientific papers are multi-author, and some degree of collaboration is required. Many scientists in the life sciences wish to author in Word because grant bodies and journals often produce templates as Word documents. Many wish to use LaTeX, because its idiomatic approach to programming documents is unreplicable with anything else. Fortunately, it is possible to induce WordPress to accept content from many different authoring tools, including Word and LaTeX (http://www.russet.org.uk/blog/2011/06/ontogenesis-knowledgeblog-lightweight-semantic-publishing/)

As a result, during the kblog project, we have seem many different workflows in use, often highly idiosyncratic in nature. These include:

Word/Email:

Many authors write using MS Word and collaborate by emailing files around. This method has a low barrier to entry, but requires significant social processes to prevent conflicting versions, particularly as the number of authors increases.

Word/Dropbox:

For the taverna kblog (http://taverna.knowledgeblog.org), authors wrote in Word and collaborated with Dropbox (http://www.dropbox.com). This method works reasonably well where many authors are involved; Dropbox detects conflicts, although it cannot prevent or merge them.

Asciidoc/Dropbox:

Used by the authors of this paper. Asciidoc (http://www.methods.co.nz/asciidoc) is relatively simple, somewhat programmable and accessible. Unlike LaTeX which can be induced to produce HTML with effort, asciidoc is designed to do so.

Of these three approaches probably the Word/Dropbox combination is the the most generally used.

From the readers perspective, a decision that we have made within knowledgeblog is to be “HTML-first”. The initial reasons for this were entirely practical; supporting multiple toolsets is hard, particularly if any degree of consistency is to be maintained; the generation of the HTML is at least partly controlled by the middleware – WordPress in kblog’s case. As well as enabling consistency of presentation, it also, potentially, allows us to add additional knowledge; it makes semantic publication a possibility. However, we are aware that knowledgeblog currently scores rather badly on what we describe as the “bath-tub test”; while exporting to PDF or printing out is possible, the presentation is not as “neat” as would be ideal. In this regard (and we hope only in this regard), the knowledgeblog experience is limited. However, increasingly, readers are happy and capable of interacting with material on the web, without print outs.

From this background and aim, we have drawn the following requirements:

  1. The author can, as much as possible, remain within familiar authoring environments;

  2. The representation of the published work should remain extensible to, for instance, semantic enhancements;

  3. The author and reader should be able to have the amount of “formal” academic publishing they need;

  4. Support for semantic publishing should be gradual and offer advantages for author and reader at all stages.

We describe how we have achieved this with three exemplars, two of which are relatively general in use, and one more specific to biology. In each case, we have taken a slightly different approach, but have fulfilled our primary aim of making life better for machine, author and reader.

3 Representing Mathematics

The representation of mathematics is a common need in academic literature. Mathematical notation has grown from a requirement for a syntax which is highly expressive and relatively easy to write. It presents specific challenges because of its complexity, the difficulty of authoring and the difficulty of rendering, away from the chalk board that is its natural home.

Support for mathematics has had a significant impact on academic publishing. It was, for example, the original motivation behind the development of TeX (http://en.wikipedia.org/wiki/TeX), and it still one of the main reasons why authors wish to use it or its derivatives. This is to such an extent that much mathematics rendering on the web is driven by a TeX engine somewhere in the process. So MediaWiki (and therefore Wikipedia), Drupal and, of course, WordPress follow this route. The latter provides plugin support for TeX markup using the wp-latex plugin (http://wordpress.org/extend/plugins/wp-latex/). Within kblog, we have developed a new plugin called mathjax-latex (http://wordpress.org/extend/plugins/mathjax-latex/) From the kblogauthor’s perspective these two offer a similar interface – differences are, therefore, described later.

Authors write their mathematics directly as TeX using one of the four markup syntaxes. The most explicit (and therefore least likely to happen accidentally) is through the use of “shortcodes” (http://codex.wordpress.org/Shortcode).

These are a HTML-like markup originating from some forum/bulletin board systems. In this form an equation would be entered as [latex]e=mc^2[/latex], which would be rendered as “\(e=mc^2\)”. It is also possible to use three other syntaxes which are closer to math-mode in TeX: $‍$e=mc^2$‍$, $latex e=mc^2$, or \‍[e=mc^2\‍].

From the authorial perspective, we have added significant value, as it is possible to use a variety of syntaxes, which are independent of the authoring engine. For example, a TeX-loving mathematician working with a Word-using biologist can still set their equations using TeX syntax; although Word will not render these at authoring time but, in practice, this causes few problems for such authors, who are experienced at reading TeX. Within an LaTeX workflow equations will be renderable both locally with source compiled to PDF, and published to WordPress.

There is also a W3C recommendation, MathML for the representation and presentation of mathematics. The kblog environment also supports this. In this case, the equivalent source appears as follows:

 <math>
 <mrow>
<mi>E</mi>
 <mo>=</mo>
 <mrow>
<mi>m</mi>
 <msup>
 <mi>c</mi>
<mn>2</mn>
 </msup>
 </mrow>
 </mrow>
</math>

One problem with the MathML representation is obvious: it is very long-winded. A second issue, however, is that it is hard to integrate with existing workflows; most of the publication workflows we have seen in use will on recognising an angle bracket turn it into the equivalent HTML entity. For some workflows (LaTeX, asciidoc) it is possible, although not easy, to prevent this within the native syntax.

It is also possible to convert from Word’s native OMML (“equation editor”) XML representation to MathML, although this does not integrate with Word’s native blog publication workflow. Ironically, it is because MathML shares an XML based syntax with the final presentation format (HTML) that the problem arises. The shortcode syntax, for example, passes straight-through most of the publication frameworks to be consumed by the middleware. From a pragmatic point of view, therefore, supporting shortcodes and TeX-like syntaxes has considerable advantages.

For the reader, the use of mathjax-latex has significant advantages. The default mechanism within WordPress uses a math-mode like syntax $‍latex e=mc^2‍$. This is rendered using a TeX engine into an image which is then incorporated and linked using normal HTML capabilities. This representation is opaque and non-semantic; it has significant limitations for the reader. The images are not scalable – zooming in cases severe pixalation; the background to the mathematics is coloured inside the image, so does not necessarily reflect the local style.

Kblog, however, uses the MathJax library (http://www.mathjax.org) this has a number of significant advantages for the reader. First, where the browser supports them, MathJax uses webfonts to render the images; these are scalable, attractive and standardized. Where they are not available, MathJax can fall-back to bitmapped fonts. The reader can also access additional functionality: clicking on an equation will raise a zoomed in popup; while the context menu allows access to a textual representation either as TeX or MathML irrespective of the form that the author used. This can be cut-and-paste for further use. Kblog uses the MathJax library (http://www.mathjax.org) to render the underlying TeX directly on the client.

Our use of MathJax provides no significant disadvantages to the middleware layers. It is implemented in JavaScript and runs in most environments. Although, the library is fairly large (>100Mb), but is available on a CDN so need not stress server storage space. Most of this space comes from the bit-mapped fonts which are only downloaded on-demand, so should not stress web clients either. It also obviates the need for a TeX installation which wp-latex may require (although this plugin can use an external server also).

At face value, mathjax-latex necessarily adds very little semantics to the maths embedded within documents. The maths could be represented as $‍$E=mc^2$‍$, \‍(E=mc^2\‍) or

<math> <mrow> <mi>E</mi> <mo>=</mo>
<mrow> <mi>m</mi>
 <msup>
<mi>c</mi><mn>2</mn> </msup>
 </mrow>
</mrow> </math>

So, we have a heterogenous representation for identical knowledge. However, in practice, the situation is much better than this. The author of the work created these equations and has then read them, transformed by MathJax into a rendered form. If MathJax has failed to translate them correctly, in line with the author’s intention, or if it has had some implications for the text in addition to setting the intended equations (if the TeX style markup appears accidentally elsewhere in the document), the author is likely to have seen this and fixed the problem. Someone wishing, for example, to extract all the mathematics as MathML from these documents computationally, therefore, knows:

  • that the document contains maths as it imports MathJax

  • that MathJax is capable of identifying this maths correctly

  • that equations can be transformed to MathML using MathJax (This is assuming MathJax works correctly in general. The authors and readers are checking the rendered representation. It is possible that an equation would render correctly on screen, but be rendered to MathML inaccurately).

So, while our publication environment does not result directly in lower level of semantic heterogeneity, it does provide the data and the tools to enable the computational agent to make this transformation. While this is imperfect, it should help a bit. In short, we provide a practical mechanism to identify text containing mathematics and a mechanism to transform this to a single, standardised representation.

4 Representing References

Unlike mathematics, there is no standard mechanism for reference and in-text citation, but there are a large number of tools for authors such as BibTeX, Mendeley (http://www.mendeley.org) or EndNote. As a result of this, the integration with existing toolsets is of primary importance, while the representation of the in-text citations is not, as it should be handled by the tool layer anyway.

Within kblog, we have developed a plugin called kcite (http://wordpress.org/extend/plugins/kcite/). For the author, citations are inserted using the syntax:[‍cite]10.1371/journal.pone.0012258[‍/cite]. The identifier used here is a DOI, or digital object identifier and, is widely used within the publishing and library industry. Currently, kcite supports DOIs minted by either CrossRef (http://www.crossref.org) or DataCite (http://www.datacite.org) (in practice, this means that we support the majority of DOIs). We also support identifiers from PubMed (http://www.pubmed.org) which covers most biomedical publications and arXiv (http://www.arxiv.org), the physics (and other domains!) preprints archive, and we now have a system to support arbitrary URLs. Currently, authors are required to select the identifier where it is not a DOI.

We have picked this “shortcode” format for similar reasons as described for maths; it is relatively unambiguous, it is not XML based, so passes through the HTML generation layer of most authoring tools unchanged and is explicitly supported in WordPress, bypassing the need for regular expressions and later parsing. It would, however, be a little unwieldy from the perspective of the author. In practice, however, it is relatively easy to integrate this with many reference managers. For example, tools such as Zotero (http://www.zotero.org) and Mendeley use the Citation Style Language, and so can output kcite compliant citations with the following slightly elided code:

 <citation>
    <layout prefix="[‍cite]" suffix="[‍/cite]"
         delimiter="[‍/cite] [‍cite]">
      <text variable="DOI"/>
    </layout>
  </citation>

We do not yet support LaTeX/BibTeX citations, although we see no reason why a similar style file should not be supported (citations in this representation of the article were, rather painfully, converted by hand). We do, however, support BibTeX-formatted files: the first author’s preferred editing/citation environment is based around these with Emacs, RefTeX, and asciidoc. While this is undoubtedly a rather niche authoring environment, the (slightly elided) code for supporting this demonstrates the relative ease with which tool chains can be induced to support kcite:

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
  (if phil-reftex-citation-override
      (setq ad-return-value (phil-reftex-format-citation entry format))
    ad-do-it))

(defun phil-reftex-format-citation( entry format )
  (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[‍[‍cite source='doi'\\]%s[‍/cite\\]]" doi)))

The key decision with kcite from the authorial perspective is to ignore the reference list itself and focus only on in-text citations, using public identifiers to references. This simplifies the tool integration process enormously, as this is the only data that needs to pass from the author’s bibliographic database onward. The key advantage for authors here is two-fold: they are not required to populate their reference metadata for themselves, and this metadata will update if it changes. Secondly, the identifiers are checked; if they are wrong, the authors will see this straightforwardly as the entire reference will be wrong. Adding DOIs or other identifiers moves from becoming a burden for the author to becoming a specific advantage.

While supporting multiple forms of reference identifier (CrossRef DOI, DataCite DOI, arXiv and PubMed ID) provides a clear advantage to the author, it comes at considerable cost. While it is possible to get metadata about papers from all of these sources, there is little commonality between them. Moreover, resolving this metadata requires one outgoing HTTP request per reference (in practice, it is often more; DOI requests, for instance use 303 redirects), which browser security might or might not allow.

So, while the presentation of mathematics is performed largely on the client, for reference lists the kcite plugin performs metadata resolution and data integration on the server. A caching functionality is provided, storing this metadata in the WordPress database. The bibliographic metadata is finally transferred to the client encoded as JSON, using asynchronous call-backs to the server.

Finally, this JSON is rendered using the citeproc-js library on the client. In our experience, this performs well, adding to the readers’ experience; in-text citations are initially shown as hyperlinks; rendering is rapid, even on aging hardware, and finally in-text citations are linked both to the bibliography and directly through to the external source. Currently, the format of the reference list is fixed, however, citeproc-js is a generalised reference processor, driven using CSL (http://citationstyles.org/). This makes it straight-forward to change citation format, at the option of the reader, rather than the author or publisher. Both the in-text citation and bibliography support outgoing links direct to the underlying resources (where the identifier allows — PubMed IDs redirect to PubMed). As these links have been used to gather metadata, they are likely to be correct. While these advantages are relatively small currently, we believe that the use of JavaScript rendering over a linked references can be used to add further reader value in future.

For the computational agent wishing to consume bibliographic information, we have added significant value compared to the pre-formatted HTML reference list. First, all the information required to render the citation is present in the in-text citation next to the text that the authors intended. A computational agent can, therefore, ignore the bibliography list itself entirely. These primary identifiers are, again, likely to be correct because the authors now need them to be correct for their own benefit.

Should the computational agent wish, the (denormalised) bibliographic data used to render the bibliography is actually available, present in the underlying HTML as a JSON string. This is represented in a homogeneous format, although, of course, represents our (kcite’s) interpretation of the primary data.

A final, and subtle, advantage of kcite is that the authors can only use public metadata, and not their own. If they use the correct primary identifier, and still get an incorrect reference, it follows that the public metadata must be incorrect (or, we acknowledge, that kcite is broken!). Authors and readers therefore must ask the metadata providers to fix their metadata to the benefit of all. This form of data linking, therefore, can even help those who are not using it.

4.1 Microarray Data

Many publications require that papers discussing microarray experiments lodge their data in a publically available resource such as ArrayExpress (10.1093/nar/gkg091). Authors do this placing an ArrayExpress identifier which has the form E-MEXP-1551. Currently, adding this identifier to a publication, as with adding the raw data to the repository is no direct advantage to the author, other than fulfilment of the publication requirement. Similarly, there is no existing support within most authoring environments for adding this form of reference.

For the knowledgeblog-arrayexpress plugin (http://knowledgeblog.org/knowledgeblog-arrayexpress), therefore, we have again used a shortcode representation, but allowed the author to automatically fill metadata, direct from ArrayExpress. So a tag such as:[‍aexp id="E-MEXP-1551"]species[‍/aexp] will be replaced with Saccharomyces cerevisiae, while:[‍aexp id="E-MEXP-1551"]releasedate[‍/aexp] will be replaced by “2010-02-24”. While the advantage here is small, it is significant. Hyperlinks to ArrayExpress are automatic, authors no longer need to look up detailed metadata. For metadata which authors are likely to know anyway (such as Species), the automatic lookup operates as a check that their ArrayExpress ID is correct. As with references (see Section ), the use of an identifier becomes an advantage rather than a burden to the authors.

Currently, for the reader there is less significant advantage at the moment. While there is some value to the author of the added correctness stemming from the ArrayExpress identifier. However, knowledgeblog-arrayexpress is currently under-developed, and the added semantics that is now present could be used more extensively. The unambiguous knowledge that:[‍aexp id="E-MEXP-1551"]species[‍/aexp] represents a species would allow us, for example, to link to the NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/).

Likewise, advantage for the computational agent from knowledgeblog­-array­express is currently limited; the identifiers are clearly marked up, and as the authors now care about them, they are likely to be correct. Again, however, knowledgeblog­-array­express is currently under developed for the computational agent. The knowledge that is extracted from ArrayExpress could be presented within the HTML generated by knowledgeblog­-array­express, whether or not it is displayed to the reader for, essentially no cost. By having an underlying shortcode representation, if we choose to add this functionality to knowledgeblog­-array­express, any posts written using it would automatically update their HTML. For the text-mining bioinformatician, even the ability to unambiguously determine that a paper described or used a data set relating to a specific species using standardised nomenclature (the standard nomenclature was only invented in 1753 and is still not used universally) would be a considerable boon.

5 Discussion

Our approach to semantic enrichment of articles is a measured and evolutionary approach. We are investigating how we can increase the amount of knowledge in academic articles presented in a computationally accessible form. However, we are doing so in an environment which does not require all the different aspects of authoring and publishing to be over-turned. More over, we have followed a strong principle of semantic enhancement which offers advantages to both reader and author immediately. So, adding references as a DOI, or other identifier, ‘automagically’ produces an in text citation and a nicely formatted reference list: that the reference list is no longer present in the article, but is a visualisation over linked data; that the article itself has become a first class citizen of this linked data environment is a happy by-product.

This approach, however, also has disadvantages. There are a number of semantic enhancements which we could make straight-forwardly to the knowledgeblog environment that we have not; the principles that we have adopted requires significant compromise. We offer here two examples.

First, there has been significant work by others on CiTO (10.1186/2041-1480-1-S1-S6) – an ontology which helps to describe the relationship between the citations and a paper. Kcite lays the ground-work for an easy and straight-forward addition of CiTO tags surrounding each in-text citation. Doing so, would enable increased machine understandability of a reference list. Potentially, we could use this to the advantage to the reader also: we could distinguish between reviews and primary research papers; highlight the authors’ previous work; emphasise older papers which are being refuted. However, to do this requires additional semantics from the author. Although these CiTO semantic enhancements would be easy to insert directly using the shortcode syntax, most authors will want to use their existing reference manager which will not support this form of semantics; even if it does, the author themselves gain little advantage from adding these semantics. There are advantages for the reader, but in this case not for both author and reader. As a result, we will probably add such support to kcite; but, if we are honest, find it unlikely that when acting as content authors, we will find the time to add this additional semantics.

Second, our presentation of mathematics could be modified to automatically generate MathML from any included TeX markup. The transformation could be performed on the server, using MathJax; MathML would still be rendered on the client to webfonts. This would mean that any embedded maths would be discoverable because of the existence of MathML, which is a considerable advantage. However, neither the reader nor the author gain any advantage from doing this, while paying the cost of the slower load times and higher server load that would result from running JavaScript on the server. More over, they would pay this cost regardless of whether their content were actually being consumed computationally. As the situation now stands, the computational user needs to identify the insert of MathJax into the web page, and then transform the page using this library, none of which is standard. This is clearly a serious compromise, but we feel a necessary one.

Our support for microarrays offers the possibility of the most specific and increased level of semantics of all of our plugins. Knowledge about a species or a microarray experimental design can be precisely represented. However, almost by definition, this form of knowledge is fairly niche and only likely to be of relevance to a small community. However, we do note that the knowledgeblog process based around commodity technology does offer a publishing process that can be adapted, extended and specialised in this way relatively easily. Ultimately the many small communities that make up the long-tail of scientific publishing adds up to one large one.

6 Conclusion

Semantic publishing is a desirable goal, but goals need to be realistic and achievable. to move towards semantic publishing in kblog, we have tried to put in place an approach that gives benefit to readers, authors and computational interpretation. As a result, at this stage, we have light semantic publishing, but with small, but definite benefits for all.

Semantics give meaning to entities. In kblog, we have sought benefit by “saying” within the kblog environment that entity x is either maths, a citation or a microarray data entity reference. This is sufficient for the kbloginfra-structure to “know what to do” with the entity in question. Knowing that some publishable entity is a “lump” of maths tells the infra-structure how to handle that entity: the reader has benefit from it looking like maths; the author has benefit by not having to do very much; and the infra-structure knows what to do. In addition, this approach leaves in hooks for doing more later.

It is not necessarily easy to find compelling examples that give advantages for all steps. Adding in CiTO attributes to citations, for instance, has obvious advantages for the reader, but not the author. However, advantages may be indirect; richer reader semantics may give more readers and thus more citations—the thing authors appreciate as much as the act of publishing itself. It is, however, difficult to imagine how such advantages can be conveyed to the author at the point of writing. It is easy to see the advantages of semantic publishing for readers, as a community we need to pay attention to advantages to the authors. Without these “carrots”, we will only have “sticks” and authors, particularly technically skilled ones, are highly adept at working around sticks.

Bibliography

This is a paper we wrote for STLR2011 also published directly on Knowledgeblog

Abstract

The web has moved from a minority interest tool to one of the most heavily used platforms for publication. Despite originally being designed by and for academics, it has left academic publishing largely untouched; most papers are available on-line, but in PDF and are most easily read once printed. Here, we describe our experiments with using commodity web technology to replace the existing publishing process; the resource describing ontologies that we have developed with this platform; and, finally, the implications that this may have for publishing in a semantic web framework.

Authors

Phillip Lord Newcastle University Newcastle-upon-Tyne, UK

Simon Cockell Newcastle University Newcastle-upon-Tyne, UK

Daniel C. Swan Newcastle University Newcastle-upon-Tyne, UK

Robert Stevens University of Manchester Manchester, UK

Introduction

The Web was invented around 1990 as a light-weight mechanism for publication of documents, enabling scientists to share their knowledge, in the form of hypertext documents. Although scientists and later most academics, like the rest of society, have made heavy use of the web, it has not had a significant impact on the academic publication process. While most journals now have websites, the publication process is still based around paper documents or electronic representations of paper documents in the form of a PDF. Most conferences still handle submissions in the same way1. Books on the web, for example, are often limited to a table of contents.

For the authors (certainly from our personal experience), the process is dissatisfying; book writing is time-consuming, tiring and takes a number of years to come to fruition. If the book has one or a few authors, it tends to reflect only a narrow slice of opinion. Multi-author collected works tend to be even harder work for the editor than writing a book solo. Books do not change frequently; they are therefore out-of-date as soon as they are available. Authors feel a greater pressure for correctness, as they will have to live with the consequences of mistakes for the many years it takes to produce a second edition; most scientists welcome feedback, but being asked to justify something you wish you had not said becomes tiresome, especially if you are waiting to update it.

For the consumer of the material (either a human reader, or a computer), the experience is likewise limited. Books on paper are not searchable, not easy to carry around, are often not cheap to buy and more commonly very expensive to buy. For the computer, the material is hard to understand, or to parse. Even distinguishing basic structure (where do chapters start, who is the author, where is the legend for a given figure) is challenging.

All of this points to a need to exploit the Web for scientists to publish in a different way than simply replicating the old publishing process. Here, we describe our experiment with a new (to academia!) form of publishing: we have used widely-available and heavily used commodity software (WordPress [7]), running on low-end hardware, to develop a multi-author resource describing the use of ontologies in the life sciences (our main field of expertise). From this experience, we have built on and enhanced the basic platform to improve the author experience of publishing in this manner. We are now extending the platform further to enable the addition of light-weight semantics by authors to their own papers, without requiring authors to directly use semantic web technologies, and within their own tool environment. In short, we believe that this platform provides a ‘cheap and cheerful’ framework for semantic publishing.

The requirements

The initial motivation for this work came from our experience within the bio-ontology community3. Biomedicine is one of the largest domains for use of ontology technology, producing large and complex ontologies such as the Gene Ontology [28] or SNOMED [27].

As an ontologist, one of the most common questions that one has is: ‘where is there a book or a tutorial that I can read which describes how to build an ontology?’. Currently, there is some tutorial information on the web, there are some books; but there is not a clear answer to the question. Many of the books are collections of research-level papers, or are technologically biased. Currently many ontologists have learned their craft through years reading mailing lists, gathering information from the web and by word of mouth. We wished to develop a resource with short and succinct articles, published in a timely manner and freely available.

We wished, also, however to retain the core of academic publishing. This was for reasons both pragmatic, principled and political. Consider, for example, Wikipedia, that could otherwise serve as a model. Our own experience suggests that referencing Wikipedia can be dangerous: it can and does change over time meaning critical or supportive comments in other articles can be ‘orphaned’. Wikipedia maintains a ‘neutral point-of-view’ which, many are of the opinion, makes it less suitable for areas where knowledge is uncertain and disagreement frequent. Finally, Wikipedia is relatively anonymous in terms of authorship: whether this affects the quality of articles has been a topic of debate [17], but was not our primary concern; pragmatically, the promotion and career structure2 for most academics requires a form of professional narcissism; they cannot afford to contribute to a resource for which they cannot claim credit. Of course, our experiences may not be reflective of the body academic overall; there has, for example, been substantial discussion of the issues of expertise on Wikipedia itself [8]. Although the reasons may not be clear, it is clear that academics largely do not contribute to Wikipedia, and that Wikimedia sees this as an issue [16].

We also had an explicit set of non-functional requirements. We needed the resource to be easy to administer and low-cost, as this mirrored our resource availability; authors should be offered an easy-to-use publishing environment with minimal ‘setup’ costs, or they would be unlikely to contribute; readers should see a simple, but reasonably attractive and navigable website, or they would be unlikely to read.

The Ontogenesis experience

Our previous experience with the use of blog software within academia was limited to ‘traditional’ blogging: short pieces about either: the process of science (reports about conferences, or papers for example); journalistic articles about other peoples research; or, personal blogging, that is articles by people who just happen to be academics. Although we wished to develop different, more formal content, this experience suggests that many academics find blogging software convenient, straight-forward enough and useful.

To test this, we decided to hold a small workshop of 17 domain experts over a two day period, and task them with generating content, conduct peer-review of this content and publish it as articles on a blog.

Terminology and the Process

Like many communities, the blogosphere has developed its own and sometimes confusing terminology. To describe the process we adopted we first describe some of this terminology. A blog is a collection of web pages, usually with a common theme. These web pages can be divided into: posts that are published (or posted) on an explicit date and then unchanged; and pages that are not dated and can change. Posts and pages have permalinks: although they may be accessible via several URLs, they have one permalink that is stable and never changes. Posts and pages can be categorised – grouped under a predefined hierarchy – or tagged – grouped using ad hoc words or phrases defined at the point of use. A blog is usually hosted with a blog engine, such as WordPress that stores content in a database, combines it with style instructions in themes to generate the pages and posts. Most blog engines support extensions to their core functionality with plugins. Most blogs also support comments or short pieces of content added to a post or page by people other than the original authors. Most blog engines also support trackbacks which are bidirectional links: normally, a snippet from a linking post will appear as a comment in the linked to post. Trackbacks work both within a single blog and between different distributed blogs. Many blogs support remote posting: as well as using a web form for adding new content, users can also post from third party applications, through a programmatic interface using a protocol such as XML-RPC or even by email. Posts and pages are ultimately written in headless HTML (that part of HTML which appears inside the body element), although the different editing environments can hide this fact from the user.

Our initial process was designed to replicate the normal peer-review process, with a single adjustment, that peer-review was open and not blind: papers would be world-visible once submitted; the identities of reviewers would be known to authors; all reviews would be public. We adopted this approach for pragmatic reasons. WordPress has little support for authenticated viewing and none for anonymisation. The full process was as follows:

  • Authors write their content and publish using which ever tooling they find appropriate.

  • The author posts their content, categorising it as under review.

  • An editor assigns two reviewers.

  • Reviewers publish reviews as posts or comments. Reviews link to articles, resulting in a trackback from article to review.

  • The author modifies the post to address reviews.

  • Once done to the editors satisfaction, the post is recategorised as reviewed.

Our expectation was that following this process, articles would not be changed or updated; this is in stark contrast to common usage for wiki-based websites. New articles could, however, be written updating, extending or refuting old ones.

Reflections on the Ontogenesis K-Blog

Our initial meeting functioned to ‘bootstrap’ the Ontogenesis K-Blog. This was useful to acquire a critical mass of content, but also, on this first outing, to explore the K-Blogprocess and technology. The setup for the day was the vannilla WordPressinstallation. The day started with a short presentation on the K-Blogmanifesto [22] and an overview of the process, including authoring and reviewing. The guidelines to authors were to write short articles on an ontology subject (a list of suggestions was offered and authors also made their own choices) and to produce the article in whatever manner they felt appropriate. There was a certain level of uncertainty among authors as to the K-Blogprocess (partly because one of the objectives of the meeting was to ‘force out’ the process) and this, naturally, pointed to the need to document the K-Blogprocess so that authors could have the typical ‘instructions to authors’.

This first meeting produced a set of 20 completed and partially completed articles. Some even had reviews. Even on the day itself there was some external interest seen from Twitter. The first external blog post (outside of those produced by attendees) happened during the meeting [19] with a second shortly after [18].

We also held a second content provision meeting and together these generated a collection of articles that felt like an academic book in terms of content, but generated with considerably less effort. This experience was also sufficient to gather requirements on how to improve the K-Blogidea. A useful K-Blogon the K-Blogprocess itself was produced by Sean Bechhofer [13]. There is also a K-Bloglooking back on the first year of the Ontogenesis K-Blog [23].

Several requirements emerged with respect to authorship. The principle of the short, more or less self-contained article was attractive (though the audience were somewhat self-selecting). Authoring directly in the editor provided by WordPress was felt to be poor by those that tried it. Authoring in a favourite editing tool and then publishing via WordPress worked reasonably well for most authors. There were, however, a variety of issues with the mechanism of this style of publishing; referring to articles that will be, but have not yet, been written. To some extent this was an artefact of the day (many articles being written simultaneously), but authors needed to refer to glossaries and articles in progress.

One stylistic issue was the habit of putting full affiliations at the top of an article. The ontogenesis theme presents the first few lines when displaying many articles, but in many cases this was simply showing the title and author affiliation; where it would be more useful to have the first sentence or so of the article itself.

For the whole K-Blog, a table of contents was felt to be important. This would give an overview of contents and a simple place for navigation about the K-Blog. This raised the issue of attribution; the table of contents needed to expose the authors, including multiple, ordered authors. This is not an unsurprising need, as the authors’ scientific reputation is involved. In this vein, making K-Blogarticles citable by issuing of Digital Object Identifiers (DOI) was requested.

For scientific credibility, the ability to handle citations easily was an obvious requirement. Natively, WordPresshas little or no support for styling citations and references. The ability to cite via DOI and, in this field, PubMed identifiers to automatically make links and produce a reference list was felt to be important. Also, having the Ontogenesis K-Blogarticles in PubMed would also be attractive to authors.

The last authorship issue was the mutability of articles. One aim of K-Blogis to enable articles to change in the light of experience and scientific development, as well as a procedural requirement for updates following review. There was felt to be a conflicting need for articles not to change, so that comments and links from other documents work in the longer term.

The last significant issue was the reviewing of articles. The aim was to have this managed by authors choosing reviewers (with editorial oversight). On the Ontogenesis K-Blogday this could work with authors calling across the room for a review. This is, however, not a sustainable approach. WordPress, however, lacks tracking facilities to manage the reviewing process, whether this is done by an author or an editor. The realisation that such management support is needed is not the greatest insight ever gained, but the requirement is there even in a light weight publishing mechanism.

Improvements to the technology

Our initial experiment with the ontogenesis K-Blogsuggested a significant number of issues with the use of WordPressfor scientific publication. In this section, we describe the extensions that we have made or used to the publication process, documentation or to WordPressitself. Following our initial experience with Ontogenesis, we have started to trial these improvements, including through another workshop which resulted in a new K-Blog [12], describing the scientific workflow engine Taverna [24]; work is also in progress on the use of a K-Blogfor bioinformatics [1], and another for public healthcare [3].

Currently, we have 11 plugins extending the basic WordPressenvironment. For completeness, all of these are shown in Table 1. Our theme is also extended in some places to support the plugins. In general, the plugins are orthogonal and will work independently of each other. One advantage of using WordPressis that many of these plugins are freely available, written and maintained by other authors; while other academic publication environments, such as the Open Journal System [5] exist and are relatively widely-used, but WordPress is used to host perhaps 10% of the web, making the plugin ecosystem extremely fertile.

Plugin

Use

URL

Co-Authors Plus

Allows K-Blog posts to have more than one author

http://wordpress.org/extend/plugins/co-authors-plus/

COinS Metadata Exposer †

Provides COinS metadata on K-Blog posts (used by Zotero, Mendeley etc)

http://code.google.com/p/knowledgeblog/

Edit Flow

Gives editorial process management infrastructure

http://editflow.org/

ePub Export

Exports K-Blog posts as ePub documents

http://wordpress.org/extend/plugins/epub-export/

KCite \(\ast \)

Automatic processing of DOIs and PMIDs into in-text citations and bibliographies

http://knowledgeblog.org/kcite-plugin

Knowledgeblog Post Metadata Plugin \(\ast \)

Exposes generic metadata in post headers

http://code.google.com/p/knowledgeblog/

Knowledgeblog Table of Contents \(\ast \)

Produces a table of contents based on a category of articles. Posts are listed with all authors

http://knowledgeblog.org/knowledgeblog-table-of-contents-plugin

Mathjax LaTeX\(\ast \)

Enables use of TeXor MathML in posts, rendered in scalable web fonts

http://knowledgeblog.org/mathjax-latex-wordpress-plugin

Post Revision Display

Publicly exposes all revisions of an article after publication

http://wordpress.org/extend/plugins/post-revision-display/

SyntaxHighlighter Evolved

Syntax Highlights source code embedded in posts

http://wordpress.org/extend/plugins/syntaxhighlighter/

WP Post to PDF

Allows visitors to download posts in PDF format

http://wordpress.org/extend/plugins/wp-post-to-pdf/

Table 1: WordPress plugins employed by K-Blog. Plugins marked with \(\ast \) are written by the authors. Plugins marked with \(\dag \) are modified by the authors.

Reviewing: The initial process was self-managed and required two reviews per article; this was found to be cumbersome. We have addressed this in two ways; first, we have defined a number of different peer-review levels (public review, author review, editorial review [15]), including a light-weight process now being used for Ontogenesis; authors now select their own reviewers, and decide for themselves when articles are complete. Second, we have added software support. Initially, we attempted to use RequestTracker – an open source ticket system, but found the user interface too complex for this purpose. We are now using the EditFlow plugin to WordPress that was designed for managing a review process—albeit a hierarchical rather than peer-review process.

Authoring Environment: The standard WordPresseditor was found impractical by most authors, even for short articles. WordPressdoes provide ‘paste from word’ functionality, but this removes all formatting which defeats the point. While the lack of a good editing environment could have been a significant problem, our subsequent experimentation has shown that it is possible to post directly from a wide variety of tools, including ‘office’ tools such as Word, Google Docs, LiveWriter and OpenOffice. This is in addition to a variety of blog-specific tools and text formats (such as asciidoc), which are suitable for some users. We have added documentation to a kblog (http://process.knowledgeblog.org) to address these. In practice, only LaTeX proved problematic having no specific support. To address this, we have produced a tool called latextowordpress; this is an adaptation of the plasTeX tool, a python based TeX processor, to produce simplified HTML appropriate for WordPresspublishing. Our experience with using the tools is that while none are perfect, sometimes requiring ‘tweaking’ of HTML in WordPress, most reduce publishing time to seconds.

Citations: We have addressed the lack of support for citations within WordPresswith a plugin called kcite. This allows authors to add citations into documents as shortcodes with either a DOI or Pubmed ID (other identifiers can and are being added to kcite). Shortcodes are a commonly used form of markup of the form: [tag att=”att”]text[/tag]; they are often found where a simplified HTML-like markup is desired. A bibliography is then generated automatically on the web server. Requiring authors to add markup to otherwise WYSIWYG tools is damaging to the user experience. We believe that this is soluable, however, by extending bibliographic tools, by developing a ‘kcite’ style-file or template; we have a prototype of this (using CSL [10]) for Zotero and Mendeley, and another for asciidoc with bibtex. It is also possible to just use native tool support in Word or LaTeX, and convert bibliographies to HTML; the disadvantage with this approach is discussed later.

Archiving and Searching: Archiving is primarly a social, rather than technological, problem. A blog engine is fully capable of storing content in the long-term, but authors and readers have to believe that it will do so. As a novel form of academic publishing, K-Blogis not automatically archived by as a scientific journal. However, we have taken advantage of its web publication; the main K-Blogsite is now explicitly archived by the UK Web Archive, as well as implicitly by other web archives. We have enhanced the website with an ‘easy crawl’ plugin–that is a single web page pointing to add articles classified as reviewed. We now support the (technical) requirements for LOCKSS and Pubmed. Simultaneously, this also enhances the searchability of K-Blog, fulfilling the requirements for Google scholar.

Non-repudiability: The K-Blogprocess does not allow authors to make semantically meaningful changes after an article has been reviewed. Unfortunately, it is hard to define ‘semantically meaningful’ computationally, so we have made no attempt to address this by locking articles; rather, all versions of articles are now accessible to the reader (WordPressprovides this facility to the authors by default). This enables community enforcement of a no-change policy.

Multiple Authors: We believe that authoring is best done outside WordPress. This also means that we do not support multiple-authorship; we have made no attempt to add collaborative features to WordPress. However, we did need articles to carry a byline attributing the articles to multiple authors; although not critical to the functioning of a K-Blog, it is socially critical to appease the professional narcissism (see Section ) of scientists. Fortunately, this is a common requirement, and a suitable WordPressplugin existed.

Identifiers: WordPress already supports permalinks; although we believe that URLs are entirely fit for purpose technologically while DOIs do little other than introduce complexity [11], K-Blogrequired DOIs for professional narcissism. We considered becoming an DOI authority, but this proved impractical. Instead, we have used DataCite [2]. This has required a small extension to WordPress to extract appropriate metadata and to store the DOIs once minted.

Metadata: K-Blognow uncovers various parts of its metadata in a number of ways; unfortunately, there appear to be a large number of (non-)standards in use, each with its own application. K-Blogcurrently provides: COiNS, enabling integration with Zotero and Mendeley; meta tags for Google Scholar; and Dublin Core tags for no specific reason than completeness. We are in the process of providing bibtex export (for bibtex!), and a JSON representation to support citeproc-js [14] in the second generation of kcite.

Mathematics and Presentation: We have also provided several pieces of technology that did not stem from concrete requirements arising from the initial Ontogenesis meeting. We have improved parts of the presentation system by adding, for example, syntax highlighting to code blocks. Additionally, we have created the mathjax-latex plugin enabling the use of TeX(or MathML) markup in posts that are then rendered in the browser using scalable fonts. WordPresshas native math-mode TeX support, but using image fonts which do not scale and have an ugly pixelated display.

Discussion

We have been motivated by a lack of enthusiasm for traditional book publishing to devise another mechanism by which we can achieve the same ends. We wished to avoid the downsides of an ‘all or nothing’ approach to creating a ‘static’ paper document that is read by relatively few people due to price. The K-Blogapproach allows authors to publish in a piecemeal fashion; writing only that which they are motivated to write using a mechanism that avoids a third party making arbitrary decisions on formatting with peculiar time-scales.

To avoid all this, the K-Blogis a light-weight publishing process based on commodity blogging software. We have taken an approach of writing short articles around a theme of ‘ontology in biology’; the Ontogenesis K-Blog. At the time of writing we have 26 articles and page viewing numbers that are pleasing (see Figure 1). These statistics are generated by WordPressdirectly, and represent (an approximation of) ‘real’ page reads, with robot and self-viewing removed. This is confirmed by the ten most read articles (Table 2) that reflect our expectations – ‘What is an ontology’ being first. In this sense, we consider the K-Blogprocess to be a success, especially when considered against the circulation of an equivalent book.

Figure 1: Month page view statistics for the Ontogenesis K-Blog.

What is an ontology?

1,737

OWL Syntaxes

1,246

Ontology Learning

882

Table of Contents

740

What is an upper level ontology?

684

Reference and Application Ontologies

630

Protege & Protege-OWL

522

Semantic Integration in the Life Sciences

517

Automatic maintenance of multiple inheritance ontologies

469

Ontologies for Sharing, Ontologies for Use

330

Table 2: Most Viewed articles for the Ontogenesis K-Blog(Totals).

The social processes with K-Blogare largely similar to traditional publishing, with one exception – reviewing is public. While we may have been interested in experimenting with this for principled reasons, in practice we adopted it because we did not know how to support blind anonymous review with WordPress. Open review is not a new idea: Request For Comments are common in standards processes; both Nupedia [4] (the fore-runner of Wikipedia) and H2G2 [6] (which predates Nupedia) use public peer-review. It is still, however, unusual in academia. In our experience from Ontogenesis, it raised no worries from among our contributors, except that reviewers often wanted to be more involved in the proofing, a role normally played by authors low down the author list; open review processes blurs these lines somewhat.

One open area for the discussion is the extent to which authors can, should be and wish to change articles after publication. While the ability to update is inherent in the web, the desire for non-repudiability was considered to be important; the contradiction here appears fundamental, and we do not feel we have reached a good compromise yet. In one sense, our use of the post-revision display plugin solves this problem; even if the article changes, it is still possible to refer to a specific version. However, like all automated versioning tools, many versions get recorded often with very fine-grained changes, which makes selection of the ‘right’ version hard to impossible. We could replace this with an explicit versioning tool, similar to a source code versioning system; but these systems are hard-to-use for those unused to them, as well as being difficult to implement well. An environment like K-Blog, however, does allow rapid publication of and bi-directional linking with articles; combined with typed linking with CiTO, the ability to publish erratum, addendum and second editions may be a better solution.

Our experiences with K-Blog, we think, are useful in understanding how semantic web technology can and will impact on the publication and library process. Both from our initial work with Ontogenesis, and subsequent work with http://taverna.knowledgeblog.org, it has become obvious that good tool support is critical. ‘Good’ in this sense can be straight-forwardly interpreted as ‘familiar’ that in general can be interpreted as MS Word. Our choice of a blogging engine here was (unexpectedly) well-advised, as this form of publication is already supported by many tools. It is also clear that there are many other tools that could be added; while Ontogenesis has the content, for example, that might be found in an academic book, it does not currently have the presentation of the book. Articles are already available as ePUB, and more recent work has used our Table of Contents plugin to provide a single site-wide ePUB of all articles [25]. Pre-existing tools such as Anthologize [9] may also be useful for adding organised collections of articles gathered from the whole.

This has a direct implication on the addition of further semantics to content. On the positive side, the use of WordPress makes semantic additions plausible in a way that many conventional publishing processes do not. For example, the publication of our (PWL, RS) recent paper [20] required conversion from the LaTeX source to PDF (by latex), to another PDF, to a MS Word file (by hand), to XML before arriving at the final HTML form. This process took many weeks, required multiple interactions between the authors and publisher. It still failed to preserve the semantic use (to humans) of Courier font highlighting in-text ontology terms and requiring post-publication correction. The equivalent blog post [21] gave us nearly instantaneous feedback on the final form, allowing us to check that the semantics was present and correct.

The requirements for semantics have, however, to be light. We have concentrated throughout K-Blog on the ease of delivery of content; even with this focus, it is hard. In most cases, asking for more work, for more semantics than authors are used to giving in papers is problematic. For example, I (PWL) attempted to add microformat-based markup to Ontogenesis, again, identifying ontology terms. So far, all article authors have ignored this markup (including, embarrasingly, myself).

One solution to this issue is to ensure that authors themselves benefit directly from extra semantics. For example, the Mathjax-Latex plugin allows WordPressto present mathematics in TeX or MathML markup in the final document, which is more semantically meaningful than the default WordPress behaviour of rendering an image. From the authors perspective, it also enables the use of TeX markup in Word, and the end product scales and looks less ugly on the web page.

With Kcite, we allow the user to embed DOIs or Pubmed IDs; this can be achieved at no cost to the user, if they already use a bibliography tool, as it can transparently produce citations for them using Kcite shortcodes. Development versions of Kcite already allow easy switching of bibliographic style that we hope will become at the option of the author (rather than the website or publisher as is currently the case), and/or the reader. With this additional information, we can also embed more semantics into the end document at no additional cost to the author, using for example the least specific CiTO cites term. However, further use of CiTO that will require the author to decide which term to use, with relatively little gain to themselves, and may require extension to bibliographic tools if we are to maintain transparency of Kcite shortcodes; even if the tools are present, it is unclear whether authors will use them. We note that semantics useful to domain authors is likely to be domain-specific; mathematicians are more likely to care about maths presentation, but less likely to care about Pubmed IDs. We need to be able to extend the publishing model and environment for different journals to cope.

From a technological perspective, we have found the use of shortcodes to be a good mechanism for readers to add semantics. They are simple and relatively easy to understand. In some cases they can be hidden from the user entirely; forcing users to add markup to otherwise WYSIWYG environments such as MS Word is best avoided. Although the direct use of a more standard XML markup would seem more sensible, in practice it requires tool support, as XML markup will be escaped by helpful remote posting tools. Extension of remote posting tools is hard (for tools like MS Word) or impossible (for cloud tools such as Google Docs or LiveWriter). A blogging engine such as WordPress makes it trivial to replace shortcodes both with a presentation format and machine interpretable microformat; for example, the development version of Kcite transforms DOI short codes ([cite]10.232/43243[/cite]) into in-text citations (Smith et al, (2002)) embedded in a span tag (<span kcite-id="10.232/43243">Smith et al, (2002)</span>) that are subsequently transformed into final presentation form within the browser using Javascript. The presentation form can also support additional semantic markup such as CiTO [26].

Although we believe that additional semantics are a good thing, we will not enforce a requirement for additional semantics on authors. If authors choose not to use kcite, then this is their choice. We need to show that they are useful. Our experience with many (non)standards such as CoINS, DOIs, OAI-ORE, LOCKSS is that they are not simple, speaking primarily to publishers or librarians. For a semantic web approach to work, it must focus on authors and readers, as they produce and consume the content. Extracting even light-weight semantics even from authors who are ontology experts is hard. For other domains, the situation may be worse.

Current publishing practices make use of semantic web technology impractical; semantics added by authors are unlikely to be represented correctly if the end product is a PDF typeset by hand. More over, we can see little point adding semantics to individual articles if this is done in a bespoke way. With K-Blog, we have focused on providing both content, and a full process, with review, using existing tools and workflows, adding semantics secondarily or incidentally where we can. As a result, the level of semantics that we have achieved is light-weight. However, we believe that K-Blog and WordPress combined with associated tooling provides all the basic requirements for a publishing process, and that it provides an attractive framework on which to build a semantic web.

Acknowledgements

We would like to acknowledge the contribution of the authors of articles for both the Ontogenesis and Taverna K-Blog, whose feedback was essential for this process. K-Blogis currently funded by JISC.

Bibliography

[1]

Bioinformatics. http://bioinformatics.knowledgeblog.org.

[2]

Datacite. http://datacite.org/.

[3]

Health and Public Health. http://health.knowledgeblog.org.

[4]

Nupedia. http://en.wikipedia.org/wiki/Nupedia.

[5]

Open Journal System. http://pkp.sfu.ca/?q=ojs.

[6]

The Guide to Life, the Universe and Everything. http://www.bbc.co.uk/h2g2/.

[7]

WordPress. http://www.wordpress.org.

[8]

Wikipedia:expert retention, 2008. http://en.wikipedia.org/wiki/Wikipedia:Expert_retention.

[9]

Anthologize, 2010. http://anthologize.org/.

[10]

Citation style language, 2010. http://www.citations-styles.org.

[11]

The problem with DOIs, 2011. http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/.

[12]

The Taverna Knowledgeblog, 2011. http://taverna.knowledgeblog.org.

[13]

Sean Bechhofer. Reflections on blogging a book. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/647.

[14]

Frank Bennett. Citeproc-js. https://bitbucket.org/fbennett/citeproc-js/wiki/Home.

[15]

Simon Cockell, Dan Swan, and Phillip Lord. Knowledgeblog types and peer-review levels. Process, 2010. http://process.knowledgeblog.org/archives/19.

[16]

Zoe Corbyn. Wikipedia wants more contributions from academics, 2011. http://www.guardian.co.uk/education/2011/mar/29/wikipedia-survey-academ% ic-contributions.

[17]

Casper Grathwohl. Wikipedia comes of age. The Chronile of Higher Education, 2011. http://chronicle.com/article/article-content/125899/.

[18]

D. Kell. Metabolomics, food security and blogging a book, 2010. http://blogs.bbsrc.ac.uk/index.php/2010/01/metabolomics-food-security-b% logging-book/.

[19]

Jim Logan. What is an ontology? | ontogenesis, 2010. http://ontogoo.blogspot.com/2010/01/what-is-ontology-ontogenesis.html.

[20]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology. PLoS One, 2010.

[21]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology, 2010. http://www.russet.org.uk/blog/2010/07/realism-and-science/.

[22]

Phillip Lord and Robert Stevens. The Ontogenesis Manifesto, 2010. http://ontogenesis.knowledgeblog.org/manifesto.

[23]

Phillip Lord and Robert Stevens. Ontogenesis: One year one. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/1063.

[24]

Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurr. Comput. : Pract. Exper., 18:1067–1100, August 2006.

[25]

Peter Sefton. Making epub from wordpress (and other) web collections, 2011. http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-% and-other-web-collections/.

[26]

David Shotton. CiTO, the Citation Typing Ontology. Journal of Biomedical Semantics, 1(Suppl 1):S6, 2010.

[27]

M.Q. Stearns, C. Price, K.A. Spackman, and A.Y. Wang. SNOMED clinical terms: overview of the development process and project status. In AMIA Fall Symposium (AMIA-2001), pages 662–666. Henley & Belfus, 2001.

[28]

The Gene Ontology Consortium. Gene Ontology: Tool for the Unification of Biology. Nature Genetics, 25:25–29, 2000.

This post carries the text of a paper accepted for PLoS One (now published). I publish it here as a pre-print because of the recent discussion on OBO discuss about realism. I have converted this from the original latex, which isn’t perfect. Apologies for errors.

The [PDF] is available here.

Adding a little reality to building ontologies for biology
Phillip Lord and Robert Stevens
School of Computing Science
Claremont Road
Newcastle University
Newcastle-upon-Tyne, UK
phillip.lord@newcastle.ac.uk
School of Computer Science
The University of Manchester
Oxford Road
Manchester, UK
robert.stevens@manchester.ac.uk

Abstract

Background: Many areas of biology are open to mathematical and computational modelling. The application of discrete, logical formalisms defines the field of biomedical ontologies. Ontologies have been put to many uses in bioinformatics. The most widespread is for description of entities about which data have been collected, allowing integration and analysis across multiple resources. There are now over 60 ontologies in active use, increasingly developed as large, international collaborations.

There are, however, many opinions on how ontologies should be authored; that is, what is appropriate for representation. Recently, a common opinion has been the “realist” approach that places restrictions upon the style of modelling considered to be appropriate.

Methodology/Principle Findings: Here, we use a number of case studies for describing the results of biological experiments. We investigate the ways in which these could be represented using both realist and non-realist approaches; we consider the limitations and advantages of each of these models.

Conclusions/Significance: From our analysis, we conclude that while realist principles may enable straight-forward modelling for some topics, there are crucial aspects of science and the phenomena it studies that do not fit into this approach; realism appears to be over-simplistic which, perversely, results in overly complex ontological models. We suggest that it is impossible to avoid compromise in modelling ontology; a clearer understanding of these compromises will better enable appropriate modelling, fulfilling the many needs for discrete mathematical models within computational biology.

Introduction

Ontologies are now widely used for describing and enhancing biological resources and biological data, largely following on from the success of the Gene Ontology [1]. Ontologies have been used for many purposes, from schema integration to value reconcilliation to query interfaces [2]. Ontologies have also become a cornerstone of computational biology and bioinformatics. As computationally amenable artifacts they are, themselves, a direct part of computational biology; many computational biologists are involved in their production and maintenance. Many more use ontologies to summarise their data, often by looking for over-representation [3], as the basis for drawing computational inferences about data [4], or as the basis for determining semantic similarity [5]. Even those not making direct computational use of ontologies are likely to come into contact with them, for example, when preparing annotation as part of their data release [6].

It is, therefore, of vital interest to computational biologists that ontologies for use within biomedicine are fit for purpose. One effort that aims to increase the quality of the ontologies available within biomedicine is the “OBO Foundry” [7]. The main tool that it uses for this is “an evolving set of shared principles governing ontology development”. The initial eleven principles of the OBO Foundry [8] were largely concerned with what might be termed ‘good engineering practice’ (ontologies must, for example, be openly available, with a common syntax, well documented, and used). These principles have later been joined by a further eleven [9]; these include principles such as “textual definitions will use the genus-species form”, “Use of Basic Formal Ontology” and, the somewhat quixotic, “terms […] should correspond to instances in reality”. These stem not from engineering practice, but from a perspective called realism.

The many different uses for ontologies that we have described are reflected in different understandings and methodologies about how and what to represent in an ontology. Over the last few years, for many uses the paradigm has moved from “a conceptualization of the application domain” toward “a description of the key entities in reality”; it is this latter approach that defines realism [10]. This approach to ontology is typified by the Basic Formal Ontology (BFO); a small upper-ontology for use within science in general and biomedical ontology building in particular [11].

There has been significant discussion regarding the possibility of representing only “real entities” in computational ontologies [12]. Likewise, there has been significant discussion about the philosophy surrounding realism and the role of ontology in its representation [10]. While it is argued by some that it is possible to represent only reality when making a domain description, there has, however, been little discussion on whether it is necessarily desirable to do so.

In this paper, we consider the implications that realism has for the choices that are open to the ontologist while they are modelling their domain of interest. In particular, we consider the implications that this has for the computational capabilities of any resultant ontology, in terms of its ability to represent scientific knowledge in a computationally amenable form, as well as the ability to perform automated inference or statistics over this knowledge. We suggest that the application of realism results in ontologies that are over-complex, awkward or limited; as such, realism falls far short of its aim of increasing the fitness-for-purpose of ontologies. This approach, therefore, is unlikely to fulfil the needs of computational biologists whom form a substantial part of both the user and developer community for bio-ontologies.

Methods

In this paper, we take the approach of a number of worked exemplars; this is a complementary approach to an in-depth consideration of the modelling decisions for a particular area or particular ontology, which we have used previously [13], as it allows broader conclusions about the general principles of ontology development. For each section, as well as the main exemplars, a number of related examples are briefly discussed, to reinforce that the issues raised are, indeed, general.

The exemplars have been selected by several criteria. First, all the main exemplars are all taken from within biomedicine; this is also true for the majority of the related examples. Second, we have chosen exemplars that provide as wide a coverage of biology as possible. For practical reasons, third, we have chosen exemplars where the underlying science is relatively basic to much of biology and is likely to be immediately clear to the reader without significant explanation.

We have chosen exemplars requiring as little knowledge of specific ontologies as possible. We refer to only three. The first is BFO (see “sec:what-realism-2”) which is a canonical example of a realist ontology. BFO is described as a cross-domain, upper-ontology; as a result, most terms fail the criteria given above; they are of poor biomedical relevance, and are not basic science or immediately clear. We have, therefore, also used PATO (see http://obofoundry.org/wiki/index.php/PATO:Main_Page); this defines “qualities” that we might consider attributes of other entities; so, the authors of this paper have a height, weight and shape, all of which are considered to be qualities of the authors. Finally, we use the relationship ontology [14]; this describes the relations between entities. So, for example, the height of the author inheres_in the author.

As discussed in this and other works [15, 16], “realism” is itself poorly defined. Where this lack of definition makes the consequences of realism hard to determine, we have taken the practical course, of showing the consequences as they play out in practice; to an extent, therefore, these three ontologies are not only exemplars for realism, but define it, as it is currently practiced. In short, for this paper, when we say “realism”, we largely mean “realism as practiced by BFO”. We do not claim, in this paper, to address all the philosophical perspectives that through time carried the name “realism”.

Results

What is Realism?

Building ontologies based on reality is obviously appealing to most scientists; after all the study of reality to determine its behaviour and laws is the goal of scientists. A brief consideration, however, shows that this notion cannot define a methodology for the building of ontologies.

Within the context of science “reality” would normally be taken to mean our experimental or observational data; but the statement that science (ontologies) should be based on experimental or observational data is a truism and, as such, has no explanatory power. The “real” in realism refers, in fact, to the belief that the categories that we can use to divide entities are, themselves, real.

This distinction stems from an old argument from philosophy; realism against conceptualism. Again, both sides of the argument agree that the world we can percieve, and as scientists, experiment on, is mind-independent. The conceptualist, however, argues that the categories that they term concepts are a product of social agreement. Conversely, the realist argues that these categories that they term universals are themselves real, that is mind independent in their own right, like the entities they describe.

This distinction may seem fairly confusing; as Russell [15] says “if I have failed to make Aristotle’s theory of universals clear, that is (I maintain) because it is not clear”. In fact, there is a third possibility that is a more empirical view—that is, if categories (or other models) help in describing and predicting experimental data, then they are useful regardless of whether they are real or otherwise [17]. As an example, the Mendelian notion of segregating units of inheritance was defined and useful many years before a complete mechanistic description of their cause was available. In this context, we note that there is no commonly used term to express this form of category; most commonly, “concept” is used.

For a field with a core activity of providing definitions, there is surprisingly little agreement on the meaning of the word “ontology”; as there have been many papers on the topic, we consider just a few that reflect the distinction between these approaches. Probably the most commonly cited definition [18] describes an ontology as “a specification of a conceptualization”. This definition emphasises the formality (i.e. logical and, therefore, computationally amenable) aspect to ontology development.

This is countered with a realist definition; while the requirements from Gruber’s definition—a formal specification—are necessary, realist ontologies add the requirement that “the nodes and edges correspond not to concepts but, rather, to entities in reality” [19].

What does“reality” in this context actually mean? Definitions such as “that which exists” are strangely circular leaving the question of what “exists” means. Smith [12] adds the priviso that reality is “captured in scientific laws”. Being a scientific law is not strictly enough, as some are later shown to be wrong, but a scientific law is the current best attempt at reality; this possibility does not make an ontology non-realist. For a realist ontology, the nodes are “universals”—entities in reality—rather than concepts; at least one particular must exist for every universal.

This still leaves the difficulty of applying the realist definition in practice. So most scientists will happily accept, for example, that a cell is real as it is an entity that can be observed, interacted with and manipulated. However, concepts such as “function” [13] have raised more discussion [20]; is this “real” or just a word biologists use as a point of reference? While the definition involving “entities in reality” maybe of philosophical interest, they are hard to turn into a specific assay; how to test whether a particular concept is, also, a universal. Instead of a clear assay for existence, realism offers direction about what concepts are NOT reality, rather than those that are reality. For example, and perhaps ironically given the negative practical definition of reality, a statement such as:

  Dog is_a not Cat

is not held to be a statement about reality as it is a logically constructed example of subsumption (an is_a relationship); there is no real universal containing particular not Cats in existence. Likewise,

  Dog is_a (Dog or Cat)

as the existence of particular Dogs and Cats does not mean that there are any particular Dog or Cats (examples modified from [12]).

This is not meant to provide a complete introduction to “realism”, but to provide a grounding for the discussion that follows; we will consider the issues raised by realism, throughout the paper. A more philosophical treatment of realism is given by Merrill [16]. It is useful to note that Gruber’s [18] statement that “And it [a computational ontology] is certainly a different sense of the word than its use in philosophy.”. In this paper, we are concerned with the ontologies as computational artefacts.

To summarise, a realist approach to ontology says that the categories or universals in to which objects or particulars fall have an existence in their own right. It is these universals and only these universals that a realist approach says should be the nodes within an ontology. In this paper we examine whether this approach is an adequate means to provide an account for the data produced by biomedicine.

Models that represent reality

In this section, we suggest that many universals have a range of representations. In some cases, the choice of representation may be obvious, such as length which has a natural scientific representation in SI units. In many cases, however, there is no clear set of criteria for choosing between representations. We consider the way that one quality, colour, could be represented ontologically.

Colour is a complex phenomenon. The colour of an object or other phenomena arises, in part, from that object and, in part, from the eye that perceives it.

A representation of the physical reality would be an account of the reflection, transmission and perception of light by an organism. Such an account of the reality of light and its perception might cover the following facts: Chlorophyll is green in reflection and red in transmission; a flower petal appears white to a human, but has UV stripes to a bee; the plant leaf and the algae appear green to humans, but have different reflection spectra because their chlorophyll co-ordinate to their Mg2+ ion in different ways.

There have been a number of different attempts to represent the complexities of colour numerically, for a number of different purposes. These are models that allow us to describe colour, without having to deal with the underlying physics or reality of colour. Probably the best known of these are RGB (Red, Green, Blue) or HSV (Hue, Saturation, Value), both of which are additive colour models appropriate for describing colour on a display screen. CYMK (Cyan, Yellow, Magenta and Black) is a subtractive colour model and commonly used for printing.

Collectively these representation schemes are known as colour models. That none of these schemes has become predominant reflects both their different uses and the preferences of different user groups.

For the ontology builder, this leaves us with a difficult choice:

  1. We bless one of the colour models, substituting the model for the underlying physics and do not describe the others.

  2. We describe all of the colour models, but do not describe that they are part of a colour model.

  3. We explicitly describe the reality of the physics, biology and the relationship to the different colour models, reflecting the practise of describing colour in much of science.

Currently, considering the PATO ontology, which is documented as being built according to realist principles, the first approach has been taken, using the HSV scheme. So, PATO has a term Color Hue (PATO:15) that is defined as :

“A chromatic scalar-circular quality inhering in an object that manifests in an observer by virtue of the dominant wavelength of the visible light; may be subject to fiat divisions, typically into 7 or 8 spectra.”

Using this model, PATO describes red (PATO:322) as :

“A color hue with high wavelength of the long-wave end of the visible spectrum, evoked in the human observer by radiant energy with wavelengths of approximately 630 to 750 nanometers.”

This modelling approach has a number of limitations.

  • The decision to choose one colour model or the other is arbitrary. While there are reasonable justifications for the use of HSV as opposed to, for example, RGB, there is no a priori justification for use of an additive colour model as opposed to a subtractive model. Both are valid, for different usage; in general, reflective colour is more common in biology (e.g. pigmentation) than emitted colour (e.g. fluorescence) which would suggest that subtractive models are more generally applicable, but a full treatment requires both.

  • There are no terms which can be used to express data described according to other colour models, necessitating a transformation between the different models into the officially “blessed” version during application of the ontology. These transformations may be lossy and not fully reversible.

The second approach is also possible. This would allow expression of data in multiple colour models, however:

  • The ontology would tend to get rather confusing as more colour models are added; colour would have children “Hue”, “Red” and “Cyan” and seven other sibling terms.

  • It is not clear which terms comprise a colour model: do values for “Hue”, “Green” and “Magenta” specify a colour?

  • It is not clear whether terms that occur in the other contexts are equivalent. Is “Red as in RGB” the same or different as Red (PATO:322)? Is “Hue as in HSV” the same or different from “Hue as in HSL” (HSL is another additive colour model).

The third approach does not suffer from the limitations described. We suggest from this analysis that it is necessary, if unfortunate, for some qualities to be explicitly described with multiple representations. To avoid confusion, the universal quality, colour, would need to be explicitly described as having multiple valid models. Yet, realism argues that we should not do this, as colour is real and not a model; more over, the focus on realism means that the documentation does not describe the choices that have been made, nor refer to the relationship between Color Hue (PATO:15) and “Hue as in HSV”. In short, realism has limited our ability to represent colour.

Related Examples

There are many different examples of this issue; having two or more models to describe the same part of reality is common. The distance between two markers on a chromosome can be measured using (one of a number of) genetic techniques. Some qualities have a bewildering array of different measurements associated with them; Wikipedia, for example, lists 13 different measurements of concentration such as molarity or \(gm^{-3}\).

This issue has been previously recognised. In computing science, explicitly modelling one model in another is a form of metamodelling. Other, non-realist, upper-ontologies such as DOLCE use the concept of Quale to describe a cognitive abstraction (such as Colour), including those over a physical quality (such as the spectral properties of reflected light) [21].

Sequences and the Central Dogma

The central dogma of molecular biology suggests that all genetic information is encoded in the DNA of a cell, as the ordered nucleotides that comprise the DNA. RNA is transcribed from this DNA. The RNA molecule also has a defined order of nucleotides related to the DNA. Finally the RNA is translated into protein.

Consider an ontology describing these entities. First, the DNA molecule has a number of properties; as well as physical dimensions (discussed further in “sec:limits-consistency”), including a length expressed in metres, it consists of a number of monomeric units. So, for example, we might say a DNA molecule with a series of nucleotide residues represented as ‘GATC’ has­Monomeric­Part 4.

This causes a slight worry from a realist perspective; the number 4 may not be a realist universal. There are no instances of 4. In this case, the number 4 is being used to describe a part of reality, so this is allowable in a realist ontology. Alternatively, we could describe the same reality using units (traditionally base-pairs or bp). Therefore, the DNA molecule has­Polymer­Length 4bp.

Accepting the use of natural numbers in this way, also means that we accept the use of sets and sequences to describe reality. One definition of 4 is a sequence. Stating that the DNA molecule represented with the sequence ‘GATC’ has­Polymer­Length 4bp is equivalent, therefore, to stating that it hasSequence ‘NNNN’ where ‘N’ is any nucleotide residue.

It should be noted, however, that the usefulness of these statements stems from our implicit knowledge. The number 4 is a natural number, so has­Monomeric­Part 4.2 is not possible. If a new monomer is attached to our DNA molecule, it will now has­Monomeric­Part 5, because the natural numbers are additive. We understand the operation of natural numbers as part of our shared, background knowledge, and we can apply this knowledge here.

Having described that the DNA molecule represented as ‘GATC’ has­Polymer­Length 4 (or hasSequence ‘NNNN’) we might wish to be more specific about the order of nucleotide residues and state hasSequence ‘GATC’. The implicit background knowledge we used previously about the natural numbers still applies here.

Next consider the process of transcription. The previous discussion about DNA likewise applies to RNA. The RNA molecule will, however, hasSequence ‘GAUL’, as RNA uses a different set of bases to DNA. Mathematically, one sequence can be determined from the other by applying a mapping; though the mapping is a human activity, not a representation of biochemical reality. To describe this, we have two options:

  • Taking the realist approach, we can continue to rely on the implicit knowledge of the biologist, as we have previously relied on an implicit understanding of the natural numbers.

  • We can be explicit about the properties of these sequences (additional to those properties shared with the naturals). We can talk about non-real world concepts such as alphabets, transformations and how these map to the real entities involved.

It should be noted that the former severely limits the ability to describe the central dogma. The transformation of DNA to RNA sequence is simple, but the transformation of RNA to protein is more complex. Again, the choice is between representing reality or representing how we practise science.

Related examples

The issues relating to sequences are fairly general. In computer science terms, these are abstract data types. The DNA sequence is a kind of sequence with special properties (a limited alphabet). Many of the physical quantities in science have special properties in this way. Consider:

Temperature:

While these look like positive real numbers, temperatures are only meaningfully subtracting from each other, which gives information about heat-flow between two bodies. Other operations (addition, multiplication) which are useful for real numbers have little meaning for temperature.

Recombination Distance:

These look like probabilities but are not, requiring a transformation to add.

There is a limitation on the ability to use abstract data types within a given ontology language; in most cases, the expressivity of the language will not allow arbitrary mathematical relations. Some languages, such as OWL, for example, provide “concrete domains”; these provide extension points within the ontology language where, for example, the special properties of temperature could be represented; other languages do not. In either case, there are limitations to these capabilities; for example, the constraint and behaviour of a concrete domain needs to be interpreted with its own semantics within a reasoner, rather than expressed explicitly within the ontology. It may make more sense in many circumstances to describe the existence of a mathematical model as discussed in “sec:go-where-science”.

The limitations of computers

Modelling continuous properties is a common problem in ontological engineering. For example, according to statistics the western world is now facing an obesity epidemic; in short many or most of us weigh too much. Understanding, however, exactly what “too much” means is not necessarily simple; a common technique to use is body mass index (BMI)—body weight divided by square of the height, which is a continuous value. The BMI range is split into 4 categories: Obese (>30), Overweight (>25), Normal (>18.5) and Underweight (<18.5). These categories represent ranges of the value of BMI.

This data simplification has many justifications. On an individual basis, the BMI is not a particularly accurate measure, so the simplification does not lose much accuracy. It is also easier to describe to patients, for whom a “BMI of 25” will be less comprehensible than being “overweight”.

Modelling some of this is straight-forward. Height and weight are modelled as properties of the individual. The BMI would therefore appear to be a property of the individual as it is a restatement of two existing properties. It would appear, therefore, that the category into which an individual falls should also be a property of the individual.

Consider the values of the property next. These categories are an abstraction over the real-world properties. Although, height as an integer value is expressed using a non-real-world entity, it is a description of a part of reality. A range, however, in the BMI does not describe part of reality in the same sense. There are no instances of BMI “Obese”. In a realist ontology, therefore, it is unclear what the relationship is between BMI Obese and the individual person.

For the statistician or computer scientist, there is an additional advantage to the simplification; four discrete groups have better computational properties than a continuous measure. Database queries become easier to write, and quicker to run. This is also true for the ontology builder; simplifying the real-world may fulfil the needs of an application for which the ontology is built, while avoiding unnecessary complexity. This is a widely used method for representing partitions of continuous values, the appropriately named value partition [22].

In the case of BMI there is a pre-existing social agreement toward a set of categories; however, even in the absence of such an agreement, the ontology builder might wish to represent a continuous range as a value partition to decrease the complexity of their ontology. The value partition is useful, but many of the concepts involved are not realist universals. The choice, then, is modelling “reality” and modelling a simplification that is easier to use and has better computational properties.

Related Examples

Splitting the two cases, there are many examples of pre-existing simplifications. From medicine, there are so many that it seems to be the norm rather than the exception: hypo- vs hyperthermic; hypo vs hypertensive; hypo- vs hyperglycemic. In many cases, these ranges have standard interpretations akin to the BMI.

There are likewise a number of constructions or design patterns that reduce complexity, extend the effective capabilities of the language or simply provide standard solutions to common problems [23].

To go where science has gone before

Many experiments in biomedicine require the measurement of some physical property of a biological system. Take, for example, the measurement of heart rate; in standard practice, this is measured in beats per minute, and is calculated simply by counting beats (\(b\)) over a time period (\(t\)) and dividing one by the other (\(b/t\)). However, what time period is appropriate? We might choose 60s, but this raises the question, what is the meaning of heart rate over shorter periods?

Fortunately, there is a standard solution to this problem, which is to define heart rate using differential calculus; so heart rate becomes \(db/dt\).

The derivative, \(db/dt\), presents some problems from a realist perspective. As noted previously (see “sec:sequ-centr-dogma”), it is possible to associate real numbers with entities; however, \(db/dt\) is \(0/0\). It is not clear whether this quantity is a universal; it is certainly the case that the expression \(db/dt\) is not a universal, yet such values and calculus itself is apowerful tool within science and not using it within ontological models is a severe restriction.

We can describe this ontologically in three ways:

  • We can model the real world entities involved – beats, time and describe nothing else.

  • We can describe rate in mathematical terms. In this case, we are defining the heart rate as a mathematical abstraction.

  • We can model the heart rate as a real world entity, \(db/dt\) as a mathematical entity and explicitly state that $latex db/dt is a model of heart rate.

These different solutions present different advantages. The first is consistent with realism. The second is consistent with the most common definition used within science. The third is consistent with both but it is unclear when to use which term (for example, is \(\Delta {}b/\Delta{} t\) an approximation of \(db/dt\), a quantification of the real world quality or both)?

In most cases for the description of science, the second option makes most sense; conflating the mathematical model with the real entity enables us to use the advantages of two different modelling techniques without introducing the confusion of the third option.

Related Examples

There are many related examples from mechanics, electromagnetics or chemistry; as with value partitions in medicine, so many that they appear to be the norm. All of these subject areas have direct relevance to biology and, perhaps even more so, to the equipment used in the practice of biology.

Mechanical examples would include velocity (\(dr/dt\)) and acceleration (\(d^2r/dt^2\)). Electromagnetics would include current (\(dC/dt\)) and capacitance (\(dV/dt\)). Chemistry examples would include rate constants and pH. In biology, population biology, systems biology and neurosciences make wide use of mathematical models. The lack of a link in realist ontologies to these mathematical models is not free from consequences (described further in “sec:discussion”).

The more general issue comes not from relating to differential calculus, but relating to pre-existing non-ontological techniques. For example, taxonomy in the linnean sense. There have been many discussions about whether species and high taxons are reflective of reality; it is certainly the case that a number of higher taxons do not reflect phylogeny [24]. Given that it is of uncertain status, should we represent taxonomy as a quality of an organism, an independent conceptualisation of the biologists or both?

The limits of consistency

Physical biological entities such as cells and organisms have an extent in the real world. This paper’s first author, for example, has a height of around 1.8m; a similar value cannot be applied meaningfully to the electronic version of this document, although it may apply to the paper that it may be printed on.

There are a number of different, well-understood mechanisms for representing physical space. We can use a dimensional or cartesian model, with three perpendicular lines with a linear scale. We can use a polar model, expressing extent using angles and a single distance. Modern physics has told us, however, that all of these are limited models of reality; physics generally uses a four dimensional Minkowskian spacetime model; here the axes are not linear; motion of the observer down one will change values down the others. Alternatively, at a quantum level, length is a probability distribution.

For the ontology builder, this leaves a difficult choice and the same choice discussed previously in “sec:colo-colo-models”: Represent the reality physicists relate; bless one, ignore the rest; describe their components but not their models; explicitly describe them.

If the ontology builder is to be consistent, then, they should make the same choice in both cases; if we describe colour models, we should explicitly describe Minkowskian spacetime, quantuum probability distributions, cartesian and polar systems.

There are, however, two important differences to colour models. First, there is a strong social bias toward cartesian systems. Secondly, within the scope of biology and the life sciences, four dimensional spacetime or quantuum models confuse rather than simplify; the relativistic corrections produce such small differences that they are statistically meaningless; similarly, describing a leg as a probability distribution adds little other than complexity.

This leaves the ontology builder with two options:

  1. We can build an ontology with a consistent relationship to reality. So, having decided to explicitly represent colour models, this suggests that we should also explicitly model 3D space, 4D spacetime and the various co-ordinate systems that are used to describe these.

  2. We build an ontology with an inconsistent relationship to reality. So, we might be explicit about colour models, but arbitrarily bless 3 dimensional space, using cartesian co-ordinates.

The compromise here is very straight-forward. The first solution retains its consistency to reality, the second is consistent with usability and usage; for biomedicine, a 3D cartesian co-ordinate system plus time is likely to be enough for the foreseeable future and makes life easier in the meantime.

The Newtonian view of the world is the best model in this case: it is good enough. When building an ontology for biomedicine, it makes most sense to use this view as it will produce the results required. If, in the future, biomedicine advances so that relativistic or quantuum representations are necessary, then current ontologies will need refactoring; even then, this future cost is likely to be offset by gains in the present.

Related examples

In the choice of units for measurement for scientific purposes, SI units are to be preferred. It should be noted, here, that there is a domain dependency; for an engineering ontology, the use of American imperial units would be inevitable.

For most of biology it is unnecessary to distinguish between the length of the calendar year and the astronomical year—the latter changing with respect to variability in the motion of the earth. There are occasions when this distinction may be important for data integration in bioinformatics as leap years and leap seconds show.

For an ecologist counting the number of trees in a sampling square 100m by 100m, they will take the area as 10,000m2; The surface is, however, neither smooth nor a Euclidean plane, so this area is wrong in reality. For much of ecology, this distinction will not matter. Again, there is a domain dependency here; whale or bird biologists interested in migration patterns may well care about the curvature of the earth.

Discussion

Realism has been held up as a methodology for “good” ontological modelling, and the production of more tightly defined and consistent ontologies. In this paper, we have discussed five different cases, with biological examples, that we might wish to model ontologically; for each, we have presented different models, describing the same underlying science. In each case, a realist solution is possible, but places either limitations or awkwardness on the models produced.

Building an ontology with a consistent relationship to reality may help to enable interoperability [7] under some circumstances. If, however, it disallows modifications for computability (see “sec:work-around-comp”), or requires arbitrary blessing for one form of specification over another (see “sec:colo-colo-models”) it may have the opposite effect.

Nor are the issues discussed in this paper free from consequences. In “sec:go-where-science”, we discussed interoperability with existing scientific models. Mathematics and physics have produced complex, refined and expressive notation systems, representing a deep understanding of how numbers and the physical world work. These are, however, not being used in current ontologies and this results in a lack of precision, errors and omissions:

Lack of Precision:

The PATO term speed (PATO:8) which is defined as:

“A physical quality inhering in a bearer by virtue of the bearer’s rate of change of position”

with a synonym of velocity; from this definition, we cannot distinguish the vector and scalar quantities of velocity and speed; indeed, it is not clear which of these two speed (PATO:8) is. Meanwhile acceleration (PATO:1028) is defined as:

“… the rate of change of the bearer’s velocity in either speed or direction”

which is implicitly a vector quantity, and contradicts the statement that speed and velocity are synonyms. The mathematical definitions (velocity as \(dr/dt\), speed \(\left|{dr/dt}\right|\), acceleration \(d^2r/dt^2\)) are precise, concise and accurate.

Errors:

Similarly, length (PATO:122) is defined as a quality; qualities have to inhere in Independent Continuants; as a Spatial Region is a child of Continuant this means that Spatial Regions cannot bear lengths. In short, in current versions of BFO, there is no intuitive way of modelling the length of a region in space.

Omissions:

BFO is mass-centric; it is currently unclear where many physical entities exist, examples including energy, waves (through a medium) or EM radiation. Likewise, it lacks a natural position for numbers (that have no particulars), patterns and distributions. Yet, these entities are key to a physical description of the world.

To our mind, these are indicative of some of the most serious flaws of realism-based ontology building. It makes little sense to replicate the models of physics using English instead of a more precise mathematical notation. If BFO had been built using direct links to a grounded physical model of the world, it seems likely that these problems would not have arisen.

We have discussed a number of concrete examples where building an ontology by considering realist concerns has detrimental consequences for the model. We believe that the real world entities and the relationships between them is only one consideration among many: simplicity, usability, fitness for purpose are equally important.

Taken to its most extreme form realism, it seems to these authors, would produce models unsuitable for use within science. There is a choice between a correct account of reality that does not allow the data of science to be adequately described and a description of reality that takes in to account how science is performed. Fortunately, most “realist” ontologies are not really so: PATOs representation of HSV for modelling colour is not a bad decision; it represents a straight-forward, pragmatic approach to ontology building, where the representation has been chosen on the basis of a use case, not the entities as they exist in reality. Similarly BFO uses a 3D plus time model of reality; it suggests that length are properties of the entity alone, without reference to the observer. This is not a true reflection of reality, but one which is a good enough approximation for use within the biomedical sciences; in short, usability and simplicity have been considered to be more important in the modelling process than the relationship of the model to reality. In accepting these compromises, BFO has placed itself squarely as a computational rather than philosophical ontology.

Despite these concerns, realism has made a contribution to the field of biomedical ontology engineering. By emphasising the importance of real-world entities and by encouraging a more specific interpretation than the generalisation of a “conceptualisation”, realism helps to avoid the introduction of unnecessary layers of abstraction. A consideration of the entities in reality may be a part of an ontology engineering process; ontology builders should have careful and considered reasons for diverting from modelling in this way and that ontologies should explicitly describe through annotations the terms that do or may divert from this view. Ontology builders should, however, be free to make this decision; the acceptance of compromise with respect to reality will result in simpler and more effective knowledge artefacts.

Johansson [10] when discussing realism asks the rhetorical question: “would you like to be treated for a physiological illness by a (non-realist) physician who is not sure that there are human bodies?” – (our emphasis). As scientists, our reply would be if their survival and success statistics were the best, we would not care whether they were a realist, a non-realist or a robot which admitted of no philosophical position at all; also, using a doctor who was strictly realist and thus cut off from much of the practise of science (such as determining heart rate) would disturb many patients. As bioinformaticians, we build ontologies to provide a descriptive and predictive model of the wealth of experimental data that is now available. In biology, the job of an ontologist is to describe data such that it can be analysed. Naturally this entails a description of entities in reality; it also, however, entails a description of science, and it entails compromise; we overlook this to our peril. The last 200 years of science shows the success and strength of this position; it is on this groundwork that we should build for the future.

Bibliography

[1]

Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. (2000) Gene Ontology: a tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–9.

[2]

Stevens R, Lord P (2008) Application of ontologies in bioinformatics. In: Staab S, Studer R, editors, Handbook on Ontologies in Information Systems, Springer. Second edition. URL http://www.cs.man.ac.uk/~stevensr/papers/handbook2.pdf.

[3]

Zeeberg B, Feng W, Wang G, Wang M, Fojo A, et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 4: R28.

[4]

Wolstencroft K, Lord P, Tabernero L, Brass A, Stevens R (2006) Protein classification using ontology classification. Bioinformatics 22: e530-538.

[5]

Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19: 1275–1283.

[6]

Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, et al. (2006) The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22: 866–873.

[7]

Smith B, Ashburner M, Rosse C, Bard J, Bug W, et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25: 1251–1255.

[8]

OBO Foundry Consortium (2006). OBO Foundry Principles. http://obofoundry.org/wiki/index.php/OBO_Foundry_Principles.

[9]

OBO Foundry Consortium (2008). OBO Foundry Principles. http://obofoundry.org/wiki/index.php/OBO_Foundry_Principles.

[10]

Johansson I (2006) Bioinformatics and biological reality. J Biomed Inform 39: 274–287.

[11]

Grenon P, Smith B, Goldberg L (2004) Biodynamic ontology: applying BFO in the biomedical domain. Stud Health Technol Inform 102: 20–38.

[12]

Smith B (2004) Beyond concepts: ontology as reality representation. In: Formal ontology in information systems: proceedings of the third conference (FOIS-2004). Ios Pr Inc, p. 73.

[13]

Lord P (2009) An Evolutionary Approach to Function. In: Bio-Ontologies 2009: Knowledge in Biology. URL http://hdl.handle.net/10101/npre.2009.3228.1.

[14]

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, et al. (2005) Relations in biomedical ontologies. Genome Biol 6: R46.

[15]

Russell B (1946) A History of Western Philosophy. Routledge.

[16]

Merrill G (2010) Ontological realism: methodology or misdirection. Applied Ontology 5: 79-108.

[17]

Dumontier M, Hoehndorf R (2010) Realism for scientific ontologies. In: 6th International Conference on Formal Ontology in Information Systems.

[18]

Gruber T (1992). What is an ontology? URL http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

[19]

Ceusters W, Smith B (2006) A realism-based approach to the evolution of biomedical ontologies. AMIA Annu Symp Proc : 121–125.

[20]

Shrager J (2003) The fiction of function. Bioinformatics 19: 1934-1936.

[21]

Seyed AP (2009) BFO/DOLCE Primitive Relation Comparison. In: BioOntologies 2009: Knowledge in Biology.

[22]

Rector A (2005). Representing specified values in owl: “value partitions” and “value sets”. W3C Working Group Note. URL http://www.w3.org/TR/swbp-specified-values/.

[23]

Egana M, Rector A, Stevens R, Antezana E (2008) Applying Ontology Design Patterns in Bio-ontologies, Springer Berlin/Heidelberg. pp. 7-16.

[24]

Schulz S, Stenzhorn H, Boeker M (2008) The ontology of biological taxa. Bioinformatics 24: i313–i321.