Archive for the ‘Science’ Category


As the quantity of data being depositing into biological databases continues to increase, it becomes ever more vital to develop methods that enable us to understand this data and ensure that the knowledge is correct. It is widely-held that data percolates between different databases, which causes particular concerns for data correctness; if this percolation occurs, incorrect data in one database may eventually affect many others while, conversely, corrections in one database may fail to percolate to others. In this paper, we test this widely-held belief by directly looking for sentence reuse both within and between databases. Further, we investigate patterns of how sentences are reused over time. Finally, we consider the limitations of this form of analysis and the implications that this may have for bioinformatics database design. We show that reuse of annotation is common within many different databases, and that also there is a detectable level of reuse between databases. In addition, we show that there are patterns of reuse that have previously been shown to be associated with percolation errors.

  • Michael J Bell
  • Phillip Lord

Plain English Summary

Bioinformaticians store large amounts of data about proteins in their databases which we call annotation. This annotation is often repetitive; this happens a database might store information about proteins from different organisms and these organisms have very similar proteins. Additionally, there are many databases which store different but related information and these often have repetitive information.

We have previously look at this repetitiveness within one database, and shown that it can lead to problems where one copy will be updated but another will not. We can detect this by looking for certain patterns of reuse.

In this paper, we explictly study the repetition between databases; in some cases, databases are extremely repetitive containing less than 1% of original sentences. More over, we can detect text that is shared between databases and find the same patterns in these that we previously used to detect errors.

This paper opens up new possibilities using bulk data analysis to help improve the quality of knowledge in these databases.


Bio-medical ontologies can contain a large number of concepts. Often many of these concepts are very similar to each other, and similar or identical to concepts found in other bio-medical databases. This presents both a challenge and opportunity: maintaining many similar concepts is tedious and fastidious work, which could be substantially reduced if the data could be derived from pre-existing knowledge sources. In this paper, we describe how we have achieved this for an ontology of the mitochondria using our novel ontology development environment, the Tawny-OWL library.

  • Jennifer D. Warrender
  • Phillip Lord

Plain English Summary

Ontologies allow complex descriptions of the world in a way that is both precise and computationally amenable — that is, computers can be used to check and query these descriptions. The mitochondria is a critical part of the cells of most organisms, being responsible for energy usage. We wished to build an ontology describing the current research on the mitochondria.

The more traditional approach to this, would have been to build the ontology from scratch; but many parts of the mitochondria, including the genes and proteins have already been described in other databases. Building from scratch on the basis of the data in these databases would be time-consuming, but also sensitive to change — if the database changes, our ontology would need updating too.

Instead we have used our new ontology development methodology to automatically extract this knowledge, and build the ontology for us providing what we describe as the scaffold for an ontology. In future, we will add more knowledge to this ontology, slowing building up the rich description of the mitochondrion that we are aiming for.


Ontology development relates to software development in that they both involve the production of formal computational knowledge. It is possible, therefore, that some of the techniques used in software engineering could also be used for ontologies; for example, in software engineering testing is a well-established process, and part of many different methodologies. The application of testing to ontologies, therefore, seems attractive. The Karyotype Ontology is developed using the novel Tawny-OWL library. This provides a fully programmatic environment for ontology development, which includes a complete test harness. In this paper, we describe how we have used this harness to build an extensive series of tests as well as used a commodity continuous integration system to link testing deeply into our development process; this environment, is applicable to any OWL ontology whether written using Tawny-OWL or not. Moreover, we present a novel analysis of our tests, introducing a new classification of what our different tests are. For each class of test, we describe why we use these tests, also by comparison to software tests. We believe that this systematic comparison between ontology and software development will help us move to a more agile form of ontology development.

  • Jennifer D. Warrender
  • Phillip Lord

Plain English Summary

Ontologies are a mechanism for representing parts of the world computationally. They allow you to describe the world in a complex way, and then query over it repeatable and consistently. However, ontologies are complex and are themselves hard to build consistently and repeatably. If the ontology is built incorrectly, then queries will give the wrong answers also.

Software is also complex and over the years, software engineers have developed many techniques for building software so that it, too, is correct. While these do not always succeed, they have allowed us to produce software that is vastly more complex than in years past. One important technique is automated testing. Here software can be run to ensure that it is behaving correctly automatically and often. To do this, we use one piece of software to test another.

We have borrowed the same technology for use with ontologies; while this has been done before, our use of commodity testing software has allowed us to scale up the tests significantly, and we describe this approach in this paper. However, while they have many similarities, ontologies are not software. The sort of tests that we need for ontologies may be different from those that we need for software. In this paper, we also describe the kinds of tests that we have used for the karyotype ontology (1305.3758), and which are probably relevant to other ontology development efforts too.

Overall, this should increase our understanding of how to build ontology tests and ontologies.


I was entertained to see the recent publication of a new paper on the definition of function (10.1186/2041-1480-5-27). I met one of the authors at a meeting a few years back in Durham, and had a very nice discussion about my own contribution to this definition which I published previously (1309.5984).

I do not want to discuss the paper in full, which is a nice paper and worth a read. I do however want to comment more specifically about the parts that explicitly and implicitly address my own paper.

At the start of the paper, the authors discuss the criteria for their definition which includes this:

Avoidance of epiphenomenalism: Functions should be determined by current performance of its bearer, not mainly by causally inert historical facts like its (evolutionary or cultural) history or a mere ascription by its producers, users, or observers

I found this a fairly strange criteria; it’s not clear to me why historical facts are inert; especially in biology the evolutionary history of an organism is surely one of the most important features. Originally, this criteria comes from another paper by Artiga who says:

We want to find out what is the lung’s function, we would probably look at what lungs actually do in our body. We would see that they enable respiration, so we would conclude that this is their function. Why they came to be here seems completely irrelevant for function attribution.

Obviously, this means “most peoples” bodies rather than just one, given that lungs do (somewhat) different things in different people. But, I do not think that why they came to be here is irrelevant, at least not if we wish to distinguish with a role. My fingers are currently engaged in typing, but few people would describe this as a function (although most would say that precise and controlled manipulation of the world is). Or to make a more extreme position, after Robert Hoehndorf, the heart actually does produce loud thumping noises. Surely not a function?

I am also slightly disappointed that what I think is one of the key points of my own function paper has been missed from their list of criteria. In it, I say:

I consider whether these definitions are applicable; for a given set of entities how do we decide whether we have a function (of either subclass) or a role.

Given a definition, I should be able to produce at least one practical test that I can use to determine whether that definition holds; I think that this notion of applicability needs to be more widely considered.

Now, my actual definition of biological function was:

A biological function is a realizable entity that inheres in a continuant which is realized in an activity, and where the homologous structure(s) of individuals of closely related and the same species bear this same biological function.

The language has been chosen to mirror BFO since it was in this context that the paper was addressed; I think it could be simplified and made more readable, but I was constrained by the language of BFO. Now, the first criticism on my definition is on technical grounds namely:

Lord claims that his definition is recursive rather than circular, despite the occurrence of the word “function” in the definiens.

My use of this form of definition was, of course, deliberate and partly provocative; perhaps, it is something that I should not have done, since it has muddied the water somewhat as this comment shows. In fact, it is very easy to work around this criticism by simply removing the recursion:

A biological function is a …. same species bear this same realizable entity.

The technical criticism has now gone. But I do not like the definition as much because “the same realizable entity” would in fact be a biological function. I think we avoid recursive definitions because they can be circular, but this is like avoiding recursive function calls because they may not terminate. And that is a shame, because, as with recursive function calls, I think this form of definition can be quite succinct. Consider:

A spouse is a person who is married to their spouse.


A brother is a man with the same parents as their brother.

If we unwind the recursion, then we get

A brother is a man with the same parents as another man.

Again, we are hiding that reality that both men in this definition are brothers.

Of course, some recursive definitions might actually be circular, and that is less good. But if the applicability of a function is also considered then this issue goes away. I can determine if some one is a spouse or a brother given these definitions, so I see no problem.

A second criticism comes from my statment that:

Hence he concludes that among the instances of realizables that are realizables for the same type of process can be both roles and functions depending on the species the realizable’s bearer belongs to. This presents a problem for the distinction between functions and roles.

I do not think that this is a problem at all, because I say quite clearly that we can distinguish between roles and functions, but that we do this for the individual role or function not at a class level:

My definition distinguishes between the two based on the nature of the relationship to the independent continuant in which they inhere. I suggest that it is very hard to make the distinction at the class level[…]. For an individual continuant bearing a realizable entity, this distinction appears to be much more straightforward.

In otherwords, “for walking on” is either a role or a function. But in human hands it is a role, while for chimps it is a function. I see no reason why the distinction at the level of the individual should be considered to be less relavant than at the class, nor why this should be problematic. Actually, it reduces the need for duplication between the role and function hierarchies; while tools like Tawny-OWL (1303.0213) may ease the maintainence of duplication, avoiding altogether still seems sensible.

The final criticism is, I think, the least worrisome. The authors say:

Had evolution stopped after the first species, according to Lord’s definition, there would not have been any biological function at all.

The slightly flippant but none the less entirely valid argument to this is, “but it didn’t”. We could equally argue against a definition of human as having two hands on the basis that they might have evolved a third.

More importantly, though, in most definitions of life the ability to adapt or evolve is part of the definition. Without this, we have a chemical process. So, without evolution, we have no life. Given this, we can rewrite the last statement as:

Had life stopped after the first species, there would not have been any biological function at all.

Which is an entirely true statement; that it drops so nicely out of my definition for biological function is a strength of my definition and not a weakness.

I feel that my definition is still a good one. Rereading my function paper now the argument still seems coherent, and the examples clear. Although I put an entire section on applicability into the function, I do rather regret that I did not introduce it as a general criteria for all ontology definitions explicitly; that this criteria has been missed is surely my fault and not the readers. Perhaps I should have spent more time on that, than on my recursive definition which was not critical to the paper.

At the same time, the fact that discussions on definitions are still going on, for a term that biologists have been using for many years again leads me back to the conclusion that the definitions of such generic terms are not nearly as important as some make out. So long as they are useful, biologists will carry on describing things as functions if it fits their ad-hoc, informal definitions that have been developed over time within a community. I cannot help but think that this is a good thing.


I’ve noted in the past some of the strange beliefs about DOIs ( One of these is that DOIs provide some magic archiving capability ( The other is the strange one that “DOIs make things citable”. This was one of the selling points for Figshare, for instance.

I’m interested to see that now GitHub have now joined the party (, and again using the justification that “DOIs make things citable”. I am lost in attempting to understand this.

First, GitHub have stable URIs for repositories. It’s in their business interests to keep these and if they change them they will break every single repository that has checked things out using the URI.

Second, if I have a github URI I actually know that I have a link to a repository, and it is fairly clear that I can clone from this repository. With a DOIs I do not. Paper, datacite item, git repo, it is not possible to tell.

Third, with a github URI I have a URI that I compare against other URIs and work out whether it is the same or different. If I have a DOI, I now have two identifiers, the DOI and the URI both of which identify the same thing. Surely, this makes the situation worse, and not better.

Am I being a little cynical in wondering why some publishers require them? Do they, perhaps have a vested interest in making things more invouluted and not just using standard web technology (

It seems to me like a clear case of DOIs are magical fairy dust. We sprinkle them on a github repository and now it is better, when actually we have made the situation worse.

The only justification that we have is “DOIs make it citable”. Is there a better one? Answers on a post-card please.


I totally missed a post by Carl Boettiger which makes some of the same points (

On the general issue of metadata, a DOI will give some harvestable metadata from the DOI, although Greycite can give much of the same metadata direct from GitHub (see for instance here). Having GitHub fix their metadata would seem to me to have been an easier win. And, of course, github URIs can be used to clone from and extract all the repository metadata using, well, git.



A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge, during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over 100, 000 sentences. Over 8000 sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately 30% are erroneous, whilst 35% appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website.

  • Michael J. Bell
  • Matthew Collison
  • Phillip Lord

Plain English Summary

There are many database resources which describe biological entities such as proteins, and genes available to the researcher. These are used by both biologists and medics to understand how biological systems work which has implications for many areas. These databases store information of various sorts, called annotation: some of this is highly organised or structured knowledge; some is free text, written in English.

The quantity of this material available means that having a computation method to check the annotation is desirable. The structured knowledge is easier to check because it is organised. The free text knowledge is much harder.

Most methods of analysing free text are based around “normal” English; biological annotation uses a highly specialised form of English, heavily controlled and with many jargon words. In this paper, we exploit this specialised form to infer provenance, to understand when sentences were first added to the database, and how they change over time. By analysing these patterns of provenance, we were able to identify patterns which are indicative of inconsistency or erroneous annotation.