Archive for the ‘Papers’ Category


As the quantity of data being depositing into biological databases continues to increase, it becomes ever more vital to develop methods that enable us to understand this data and ensure that the knowledge is correct. It is widely-held that data percolates between different databases, which causes particular concerns for data correctness; if this percolation occurs, incorrect data in one database may eventually affect many others while, conversely, corrections in one database may fail to percolate to others. In this paper, we test this widely-held belief by directly looking for sentence reuse both within and between databases. Further, we investigate patterns of how sentences are reused over time. Finally, we consider the limitations of this form of analysis and the implications that this may have for bioinformatics database design. We show that reuse of annotation is common within many different databases, and that also there is a detectable level of reuse between databases. In addition, we show that there are patterns of reuse that have previously been shown to be associated with percolation errors.

  • Michael J Bell
  • Phillip Lord

Plain English Summary

Bioinformaticians store large amounts of data about proteins in their databases which we call annotation. This annotation is often repetitive; this happens a database might store information about proteins from different organisms and these organisms have very similar proteins. Additionally, there are many databases which store different but related information and these often have repetitive information.

We have previously look at this repetitiveness within one database, and shown that it can lead to problems where one copy will be updated but another will not. We can detect this by looking for certain patterns of reuse.

In this paper, we explictly study the repetition between databases; in some cases, databases are extremely repetitive containing less than 1% of original sentences. More over, we can detect text that is shared between databases and find the same patterns in these that we previously used to detect errors.

This paper opens up new possibilities using bulk data analysis to help improve the quality of knowledge in these databases.


Bio-medical ontologies can contain a large number of concepts. Often many of these concepts are very similar to each other, and similar or identical to concepts found in other bio-medical databases. This presents both a challenge and opportunity: maintaining many similar concepts is tedious and fastidious work, which could be substantially reduced if the data could be derived from pre-existing knowledge sources. In this paper, we describe how we have achieved this for an ontology of the mitochondria using our novel ontology development environment, the Tawny-OWL library.

  • Jennifer D. Warrender
  • Phillip Lord

Plain English Summary

Ontologies allow complex descriptions of the world in a way that is both precise and computationally amenable — that is, computers can be used to check and query these descriptions. The mitochondria is a critical part of the cells of most organisms, being responsible for energy usage. We wished to build an ontology describing the current research on the mitochondria.

The more traditional approach to this, would have been to build the ontology from scratch; but many parts of the mitochondria, including the genes and proteins have already been described in other databases. Building from scratch on the basis of the data in these databases would be time-consuming, but also sensitive to change — if the database changes, our ontology would need updating too.

Instead we have used our new ontology development methodology to automatically extract this knowledge, and build the ontology for us providing what we describe as the scaffold for an ontology. In future, we will add more knowledge to this ontology, slowing building up the rich description of the mitochondrion that we are aiming for.


Ontology development relates to software development in that they both involve the production of formal computational knowledge. It is possible, therefore, that some of the techniques used in software engineering could also be used for ontologies; for example, in software engineering testing is a well-established process, and part of many different methodologies. The application of testing to ontologies, therefore, seems attractive. The Karyotype Ontology is developed using the novel Tawny-OWL library. This provides a fully programmatic environment for ontology development, which includes a complete test harness. In this paper, we describe how we have used this harness to build an extensive series of tests as well as used a commodity continuous integration system to link testing deeply into our development process; this environment, is applicable to any OWL ontology whether written using Tawny-OWL or not. Moreover, we present a novel analysis of our tests, introducing a new classification of what our different tests are. For each class of test, we describe why we use these tests, also by comparison to software tests. We believe that this systematic comparison between ontology and software development will help us move to a more agile form of ontology development.

  • Jennifer D. Warrender
  • Phillip Lord

Plain English Summary

Ontologies are a mechanism for representing parts of the world computationally. They allow you to describe the world in a complex way, and then query over it repeatable and consistently. However, ontologies are complex and are themselves hard to build consistently and repeatably. If the ontology is built incorrectly, then queries will give the wrong answers also.

Software is also complex and over the years, software engineers have developed many techniques for building software so that it, too, is correct. While these do not always succeed, they have allowed us to produce software that is vastly more complex than in years past. One important technique is automated testing. Here software can be run to ensure that it is behaving correctly automatically and often. To do this, we use one piece of software to test another.

We have borrowed the same technology for use with ontologies; while this has been done before, our use of commodity testing software has allowed us to scale up the tests significantly, and we describe this approach in this paper. However, while they have many similarities, ontologies are not software. The sort of tests that we need for ontologies may be different from those that we need for software. In this paper, we also describe the kinds of tests that we have used for the karyotype ontology (1305.3758), and which are probably relevant to other ontology development efforts too.

Overall, this should increase our understanding of how to build ontology tests and ontologies.



A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge, during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over 100, 000 sentences. Over 8000 sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately 30% are erroneous, whilst 35% appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website.

  • Michael J. Bell
  • Matthew Collison
  • Phillip Lord

Plain English Summary

There are many database resources which describe biological entities such as proteins, and genes available to the researcher. These are used by both biologists and medics to understand how biological systems work which has implications for many areas. These databases store information of various sorts, called annotation: some of this is highly organised or structured knowledge; some is free text, written in English.

The quantity of this material available means that having a computation method to check the annotation is desirable. The structured knowledge is easier to check because it is organised. The free text knowledge is much harder.

Most methods of analysing free text are based around “normal” English; biological annotation uses a highly specialised form of English, heavily controlled and with many jargon words. In this paper, we exploit this specialised form to infer provenance, to understand when sentences were first added to the database, and how they change over time. By analysing these patterns of provenance, we were able to identify patterns which are indicative of inconsistency or erroneous annotation.


Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

  • Phillip Lord
  • Lindsay Marshall

Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.


The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord

Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.