Archive for the ‘Ontology’ Category

In this post, I will describe what I call connection points and explain how they can be used to enable modularity and overcome problems with scalability of reasoning in OWL.

One of the recurrent problems with building ontologies is mission creep; what starts simple rapidly expands until many different areas of the world are described.

I faced this problem recently, when I was asked about the axiomatisation that I described in my paper about function (1309.5984). Well, the axiomatisation exists, but it was never very complete; so, I thought I should redo it, probably with Tawny-OWL (http://www.russet.org.uk/blog/2366).

To start off with a simple declaration of function, we might choose something like this:

(defclass Function
  :subclass (only realisedIn nProcess))

Or, in rough English, a function is something that displays itself only when involved in a process (the n in nProcess is to avoid a name clash). Now, immediately, we hit the mission-creep problem. Traditionally, functions have been considered to be some strain of continuant, and so it might be expected that we would only need to describe classes that are continuants to define a function. And, yet, straight away, we have a process. To make this definition meaningful, we need to distinguish between processes and everything else, and pretty quickly, our ontology of function requires most of an upper ontology.

This has important consequences. First, if the upper ontology in use is any size at all, or alternatively has a complex axiomatisation, then immediately a lot of axioms have to be reasoned over, and this can take considerable time.

Second, and probably more importantly, the choice of an upper ontology can be quite divisive. We have argued that a single representation for knowledge is neither plausible nor desirable (http://www.russet.org.uk/blog/1713) — this limits the ability to abstract, meaning that all of the complexity has to be dealt with all of the time; in essence, an extreme example of mission creep. If, for example, BFO is used, then the representation of entities whose existence we are unsure about becomes difficult. Conversely, if SIO is used, uncertain objects come regardless.

In the rest of this post, I will describe the how we can use the OWL import mechanism to define what I term connection points to work around this problem.


Identifiers and Imports

One of the interesting things about OWL is that, as a web based system, it uses global identifiers in the form of IRIs (or URIs, or URLs, as you wish); I can make statements about your concepts, you can make statements about mine. However, not all OWL ontologies share the same axiom space; this is controlled explicitly, through the OWL import mechanism. In short, while you can make statements about my ontology, I do not have to listen. The practical upshot of this is that it is possible to share identifiers between two ontologies without sharing any axioms, or to share axioms in one direction only.

One nice use of this is with a little upper ontology that I built mostly to try out Tawny, called tawny.upper. This comes in two forms, one in EL profile, and one in DL; the latter has more semantics but is slower to reason over. The DL version imports the EL version but, unusually, introduces no new identifiers at all, it just refines the terms in the EL version with the desired additional semantics. Downstream users can switch between EL and DL semantics by simply adding or removing an OWL import statement.


Alternative forms of import

The ability to share identifiers but not axioms has been used by others, as it provides a partial solution to the problem of big imports. MIREOT (http://precedings.nature.com/documents/3576/version/1), for example, defines an alternative import mechanism. MIREOT is described as a minimal information standard (http://precedings.nature.com/documents/3574/version/1); in this it is rather simple, as the minimal information required to reference (identify) an ontology term its identifier and that of its ontology. In practice MIREOT is a set of tools that, at its simplest, involves sharing just the identifier and not the semantics. This can help to reduce the size of an ontology significantly.

An extreme use-case for this would be in our karyotype ontology (1305.3758); if we wished “human” to refer to the NCBI taxonomy, we could import 100,000s of classes to use one, increasing the size of the ontology by several orders of magnitude in the process. One solution is to just use the identifier and not owl import the NCBI taxonomy.

However, this causes two problems. First, following our example we can no longer infer that, for example, a Human karyotype is a Mammalian karyotype; these semantics are present only in the NCBI taxonomy, and we must import its semantics if we wish to know this; similarly, we would be free to state that, for example, a human karyotype was also a fly karyotype. The second problem is that, in tools like Protege, the terms becomes unidentifiable, because the rdfs:label for the term has not been imported, and the NCBI taxonomy uses numeric identifiers.

The MIREOT solution is to extract a subset of the axioms in the upstream ontology, and then import these; obvious subsets would be all the labels of terms used in a downstream ontology, although MIREOT uses a slightly more complex system (http://precedings.nature.com/documents/3576/version/1). This would solve the problem of terms being unidentifiable; still, though, human would not be known to be mammalian. Another subset would be all terms from mammal downwords (with their labels). Now, human would be known to be a mammal, but not known to not be a fly. As you increase the size of the subset, you increase the inferences that you can make, but the reasoning process will get slower.

From my perspective, the second of these seems sensible; large ontologies reason slowly and there is no way around this, until reasoner technology gets better. For this reason, I will probably implement something similar in tawny (with an improvement suggested later). The first, however, seems less justified. We are effectively duplicating all the labels in the upstream ontology, with all this entails, for the purpose of display; we can minimise these problems, by regularly regenerating the imported subset from the source ontology regularly, but this is another task that needs to be done.

Tawny is less affected by this from the start, since the name that a developer uses can exist only in Clojure space; more over, when displaying documentation, tawny can use data from any ontologies, rather than those imported into the current ontology. We do not need to duplicate the MIREOT subset, we just need to know about it.


Connection Points

While MIREOT is a sensible idea, it is nonetheless seen as a workaround, a compromise solution to a difficult problem (http://precedings.nature.com/documents/3574/version/1). However, in this section, I will discuss a simpler, and more general solution that helps to address the problem of modularity.

Consider, a reworked version of the definition above, with one critical change. The nProcess term is now referencing an independent Clojure namespace. The generated OWL from this ontology will include nProcess simply as a reference.

(defclass Function
  :subclass (only realisedIn
                  connection.upper/nProcess))

This is different from the MIREOT approach which maintains that the minimal information is the identifier for the term and the identifier for the ontology. In this case, we only have the former. This difference is important, as I will describe later.

In one sense, we have achieved something negative. We now have a term in our function ontology, with no semantics or annotations. Oops (http://oeg-lia3.dia.fi.upm.es/oops/index-content.jsp) has this in their catalogue of ontology errors:

P8. Missing annotations: ontology terms lack annotations properties. This kind of properties improves the ontology understanding and usability from a user point of view.

— OOPS

However, this problem can be fixed by the editing environment; and, indeed, using Tawny it is. We have a meaningful name, despite a meaningless identifier, and we can see the definition of nProcess should we choose. I call these form of references connectors, and they have interesting properties. In this case, using nProcess is a required connector. The function ontology needs it to have its full semantic meaning, but it is not provided.

So, let us consider how we might use these connection points. First, for this example, we need a small upper ontology; in this case, I use the simplest possible ontology to demonstrate my point.

(defontology upper)

(as-disjoint
 (defclass nProcess)
 (defclass NotProcess))

Now, considering our function definition earlier; imagine that we wish to use this in a downstream ontology to define some functions. In this case, we define a child of Function which is realisedIn something which is NotProcess. The simplest possible way of doing this is to use all three of the entities (Function, realisedIn and NotProcess) as required connection points. We import no other ontologies here, so we can infer nothing that is not already stated.

(defontology use-one)

(defclass FunctionChild
  :subclass connection.function/Function
  (owl-some connection.function/realisedIn
            connection.upper/NotProcess))

In our second use, we now import our function ontology. At this point, the value of the shared identifier space starts to show its value; we now understand the semantics of our Function term because it uses the same identifier as the term in the function ontology.

This does, now, allow us to draw an additional inference; any individual of FunctionChild must be realisedIn an instance of NotProcess which, itself, we can infer to be a child of Process because the function ontology claims this. Or, in short, NotProcess and Process cannot be disjoint, if our ontology is to remain coherent. This ontology remains coherent, however, because we have not imported the upper ontology.

(defontology use-two)
(owl-import connection.function/function)

;; this ontology looks much the same as use-one
(defclass FunctionChild
  :subclass connection.function/Function
  (owl-some connection.function/realisedIn
            connection.upper/NotProcess))

In the final use, we import both ontologies. The function import allows us to conclude that NotProcess and Process cannot be disjoint, while out upper ontology tells us that they are, and at this point, our ontology becomes incoherent. The required connection point in the function ontology has now been provided by term in our upper ontology.

(defontology use-three)
(owl-import connection.function/function)
(owl-import connection.upper/upper)

(defclass FunctionChild
  :subclass connection.function/Function
  (owl-some connection.function/realisedIn
            connection.upper/NotProcess))

The critical point is that while the function ontology references some term in its definition, the exact semantics of that term are not specified. These semantics are at the option of the downstream user of function ontology; in use-three, we have decided to fully specify these semantics. But we could have imported a totally different upper ontology had we chosen, either using the same identifiers, or through a bridge ontology making judicious use of equivalent/sameAs declarations. In short, the semantics has become late binding.

We can use this technique to improve on MIREOT. Instead of importing our derived ontology, we can now use connection points instead. The karyotype ontology can reference the NCBI taxonomy, and leave the end user to choose the semantics they need; if the user wants the whole taxonomy, and is prepared to deal with the reasoning speed, then have this option. This choice can even be made contextually; for example, an OWL import could be added on a continuous integration platform (http://www.russet.org.uk/blog/2324) when reasoning time is less important, but not during development or interactive testing.


Future Work

While the idea of connection points seems sound, it has some difficulties; one obvious problem is that the developer of an ontology must choose the modules, with connection points for themselves. We plan to test this using SIO; we have already been working on a tawnyified version of this, to enable investigation of pattern-driven ontology development. We will build on this work by attempting to modularise the ontology, with connection points between them.

Currently, the use of this form of connection points adds some load to the downstream ontology developer. It would be relatively easy for a developer to build an ontology like use-one or use-two above by mistake, accidentally forgetting to add an OWL import. Originally, when I built tawny, I wanted to automate this process — a Clojure import would mean an OWL import, but decided against it; obviously this was a good thing as it allows the use of connection points. I think we can work around this by adding formal support for connection points, so that for example, the function ontology can declare that nProcess needs to be defined somewhere, and to issue warnings if it it is not.


Conclusions

In this post, I have addressed the problem of ontology modularity and described the use of connection points, enabling a form of late binding. In essence, we achieve this by building on OWLs web nature — shared identifiers do not presuppose shared semantics, in different ontologies. While further investigation is needed, this could change the nature of ontology engineering, allowing a more modular, more scalable and more pragmatic form of development.


Acknowledgements

Thanks to Allyson Lister and James Malone for reviewing this article.

Bibliography


Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord

Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.

When I started work on Clojure-owl the original intention was to provide myself with a more programmatic environment for writing ontologies, where I could work with a full programming language at to define the classes I wanted (http://www.russet.org.uk/blog/2214). After some initial work with functions taking strings, I have moved to an approach where classes (and other ontological entities), are each assigned to a Lisp symbol (http://www.russet.org.uk/blog/2254). I’m using “symbol” rather than “atom” because its a bit more accurate, especially as Clojure uses “atom” with a different meaning.

This means that I now have something which allows me to write ontological terms looking something like this:

(defclass a)
(defclass b :subclass a)

(defoproperty r)
(defclass d
     :subclass (some r b))

While this is quite nice, and looks fairly close to Manchester syntax (http://www.w3.org/TR/owl2-manchester-syntax/), ultimately, so far all this really provides me with is a slightly complex mechanism for achieving what I could already do; which raises the questions, why not just use Manchester syntax? Why bother with the Lisp if this is all I am to achieve?

I think I have now got to the point where the advantages are starting to show through, as I have started to create useful macros, which operate at a slightly higher level of abstraction from Manchester syntax. I will explain this using examples, perhaps inevitably, based around pizza (http://robertdavidstevens.wordpress.com/2010/01/22/why-the-pizza-ontology-tutorial/), which I have started to develop using Clojure-owl.

First I wanted to be able to define several classes at once, rather than having to use a somewhat long-winded defclass form for each; for this I have written a macro called declare-classes — perhaps a slight misnomer, as it also adds the classes to the ontology. This example shows the purpose:

  (declare-classes
   GoatsCheeseTopping
   GorgonzolaTopping
   MozzarellaTopping
   ParmesanTopping)

In practice, this may not be that useful for an ontology builder, as it creates a bare class; no documentation, nothing else. It may be useful for forward-declaration (like Clojure declare).

One slightly unfortunate consequence of the decision to use lisp symbols is I know find myself writing a lot of macros. For those who have not used lisp before, most work is done with functions. Macros are only necessary when you wish to extend the language itself. They tend to be more complex to write and to debug, although fortunately are easy to use. Compare, for example, the definition of declare-classes to that of the functional equivalent which uses strings.

(defmacro declare-classes
  [& names]
  `(do ~@(map
          (fn [x#]
            `(defclass ~x#))
          names)))

(defun f-declare-classes
  [& names]
  (dorun
   (map #(owlclass x) names)))

Even in this case, there is more hieroglyphics in the macro — two backticks, one unquote splice and some gensym symbols although Clojure’s slightly irritating lazy sequences and the resultant dorun mean that the two are nearly as long as each other. I suspect that the macros are going to get more complex, however. In most cases, should not be the user of the library that has to cope though.

While this provided a useful convenience, I also wanted a cleaner method for declaring disjoints. Consider this example:

(defclass a)
(defclass b)
(defclass c)

(disjointclasses a b c)

This is reasonably effective, but a pain if there are many classes, as they all need to be listed in the disjointclasses list. Worse, this is error prone; it is all too easy to miss a single class out, particularly if a new classes is added. So, I have now implemented an as-disjoint macro which gives this code:

(as-disjoints
   (defclass a)
   (defclass b)
   (defclass c))

This should avoid both the risk of dropping a disjoint, as well avoiding the duplication. An even more common from is to wish to declare a set of classes as disjoint children. Again, I have provided a macro for this, which looks like this:

 (defclass CheeseTopping)

 (as-disjoint-subclasses
  CheeseTopping

  (declare-classes
   GoatsCheeseTopping
   GorgonzolaTopping
   MozzarellaTopping
   ParmesanTopping))

Although this was not my original intention, these are actually nestable. This gives the interesting side effect that the ontology hierarchy is now represented in the structure of the lisp. Example below is an elided hierarchy from pizza. Lisp programmers will notice I have rather exaggerated the indentation to make the point.

(as-disjoint-subclasses
 PizzaTopping

    (defclass CheeseTopping)

    (as-disjoint-subclasses
         CheeseTopping

        (declare-classes
            GoatsCheeseTopping))

    (defclass FishTopping)
    (as-disjoint-subclasses
        FishTopping

        (declare-classes AnchoviesTopping))

    (defclass FruitTopping)
    (as-disjoint-subclasses
         FruitTopping

         (declare-classes PineappleTopping)))

Of course, it is not essential to do this. The nested use of as-disjoint-subclasses confers no semantics; but it does allow juxtaposition of a class and it’s children.

Being able to build up macros in this way was the main reason I wanted a real programming language; those described here are, I think, fairly general purpose; so, this form of declaration could also be supported in any of the various syntaxes, although it would require update to the tools. However, some ontologies will benefit from less general purpose extensions. These are never going to be supported in syntax specification.

Still, it is not all advantage. Using a programming language means embedding within this language. And this means that some of names I would like to use are gone; http://clojuredocs.org/clojure_core/clojure.core/some [some] is the obvious example. While Clojure has good namespace support, functions in clojure.core are available in all other namespaces; like all lisps, Clojure lacks types which would have avoided the problem. There are other ways around this, but ultimately clashing with these names is likely to bring pain; for example, I could always explicitly reference clojure-owl functions; but writing owl.owl.defclass rather than defclass seems a poor option; hence, some has become owlsome, and comment has become owlcomment. I have decided to accept the lack of consistency and kept only and label; the alternative, taken by the OWL API to appending OWL to everything seems too unwieldy.

Bibliography

With my initial work on developing a Clojure environment for OWL (http://www.russet.org.uk/blog/2214), I was focused on producing something similar to Manchester syntax (http://www.w3.org/TR/owl2-manchester-syntax/). Here, I describe my latest extensions which makes more extensive use of Lisp atoms. The practical upshot of this should be to reduce errors due to spelling mistakes, as well as enabling me to add simple checks for correctness.

The desire for a simple syntax is an important one. I would like my library to be usable by people not experienced with Lisp, although I am clearly aware that this sort of environment is likely to be aimed at those with some programming skills. I have managed to produce a syntax which, I think, is reasonable straight forward. It has more parentheses than Manchester syntax, but is easier in other ways, especially now that I have learnt a little more about how Clojure namespaces work. For example, this defines a class in OWL.

(owlclass "HumanArm"
          :subclass "Arm" (some "isPartOf" "Human")
          :annotation (comment "The Human arm is an Arm which is part of a human"))

One of my initial desires for the Clojure mode was to enable the use of standard tools that we have come to expect from a modern programming language, which should enable us to build a more pragmatic ontology building methodology (http://robertdavidstevens.wordpress.com/2011/05/26/unicorns-in-my-ontology/). The first of these is a unit testing environment. Clojure already has one of these integrated. So far, I have only used this for testing my own code; so, for example, this is the current unit test for the owlclass function used above.

(deftest owlclass
  (is (= 1
         (do (o/owlclass "test")
             (.size (.getClassesInSignature
                     (#'o/get-current-jontology))))))
  (is (instance? org.semanticweb.owlapi.model.OWLClass
                 (o/owlclass "test"))))

There are, however, some limitations to the approach that I have taken so far. Consider this statement:

(owlclass "HumanArm"
          :subclass (some "isPartOf" "Humn") "Arm"
)

This is broken because I have referred to the class Humn which I probably do not want to exist because I have spelt it wrongly. Unfortunately, as it stands my code does not know this and so will create the class “Humn”. Now, this form of error is not that likely to happen; tools such as Kudu (http://robertdavidstevens.wordpress.com/2010/04/24/my-own-ontology-projects) enforce this correctness in the Editor, while pabbrev.el (http://www.russet.org.uk/blog/2161) provides “correctness-by-completion”. None the less, these errors will happen and I do not want them to. There are a variety of ways that I could build this form of checking in — generally, this would involve introspecting over the ontology to see if classes already exist.

However, I have taken a different approach, so that I can use the Lisp itself to prevent the problem. To do this, for each class created, I generate a new Lisp symbol; likewise, object property and the ontology itself. The practical upshot of this, I that I can write code like so:

(defclass a)
(defclass b :subclass a)

(defoproperty r)
(defclass d
     :subclass (some r b))

;; will fail as f does not exist
(defclass e
     :subclass f)

;; will fail as r and b are the wrong way around
(defclass e
     :subclass (some b r))

The advantages are three-fold. Firstly, it’s slightly shorter, and there is no need to use quotes all over the place. Secondly, it is no longer possible to refer to a class that has not yet been defined; Clojure will pick this up immediately; from the user perspective, you can test your statements as you go, as soon as you have written them, by evaluating them. Finally, because the atoms carry values which are typed, we can also detect errors such as using a property when a class is necessary.

Of course, the original functions are all still in place; there would be no point defining symbols if the intention was to use the API entirely programmatically. But, my intention for Clojure-OWL is to have environment for humans (well, programmers anyway) to develop ontologies with.

There is a final advantage to this, that I have not yet exploited. Currently, I have generated the name of the OWL class directly from the symbol name. So, in the above example the class a will have a name “a“. There are some problems with this. Not all characters are legal in Clojure symbol names nor in OWL class names, and the set of characters is not the same. So, while this is a useful default, I will formally separate these. At the same time, I think that this will allow me to address a second problem, that of semantics vs semantics free identifiers (http://www.russet.org.uk/blog/2040). I can call a class, ontology or object property anything at all, and refer to it with a easy to remember identifier. I might use something like this:

(defoproperty has_part
   :name "BFO_OOOOO51")

The is still a significant amount of work to do yet; I haven’t made a complete coverage of OWL yet, just the most important parts (i.e. the bits that I use most often). Next, I need to start building some predicates so I can test (asserted) subclass relationships. So far, however, this approach is showing significant promise.

Bibliography

I have been struggling for a while with OWL development environments. While Protege provides a nice GUI based system, this has the limitations of many such systems; it allows you to do what the authors intended, but not all of the things that you might wish.

It is partly for this reason that I have been developing my own OWL Manchester syntax mode for Emacs (http://www.russet.org.uk/blog/2161); I lose a lot from Protege, but then I also gain the ability to manipulate large numbers of classes at once, as well as easy access to versioning. These things are useful.

Still, the environment is lacking in many ways; recently, while building an ontology for karyotypes (http://www.russet.org.uk/blog/2202), I wanted a more programmatic environment. A trivial example, for instance, comes from the human chromosomes; there are 22 autosomes in all. These can easily be expressed in OWL with 22 classes (plus X and Y). The problem is that all of these classes are likely to be very similar, which produces a code duplication problem. Of course, this is not a new problem; OPPL — the ontology pre-processor language was created at least in part for this purpose (10.1038/npre.2009.4006.1).

The main problem with OPPL, however, is that is a Domain Specific Language; while this makes it well adapted to its task, it also means that it lacks many basic features of a “real” programming language. Another possibility is to use the OWL API (10.1007/978-3-540-39718-2_42) (I am actually on this paper, but I publicly acknowledge that this was a rather generous attribution from Sean Bechhofer; I did do some work on the API, but not much, and I suspect none of my work remains). However, a brief look at the OWL API tutorial shows a problem. This code creates two classes and makes one a subclass of another.

OWLOntologyManager m = create();
OWLOntology o = m.createOntology(pizza_iri);
// class A and class B
OWLClass clsA = df.getOWLClass(IRI.create(pizza_iri + "#A"));
OWLClass clsB = df.getOWLClass(IRI.create(pizza_iri + "#B"));
// Now create the axiom
OWLAxiom axiom = df.getOWLSubClassOfAxiom(clsA, clsB);
// add the axiom to the ontology.
AddAxiom addAxiom = new AddAxiom(o, axiom);
// We now use the manager to apply the change
m.applyChange(addAxiom);
// remove the axiom from the ontology
RemoveAxiom removeAxiom = new RemoveAxiom(o, axiom);
m.applyChange(removeAxiom);

Aside from the intrinsic problems of Java — the compile, run, test cycle is rather clunky for this sort of work, this amount of code to achieve something straightforward makes this a little untenable.

Class: piz:A
    SubClassOf:
        piz:B

However, while Java and the OWL API do not seem a good choice for manipulating OWL directly, rewriting everything from first principles would also be a bad idea.

One solution to this problem came to my attention recently, in the shape of Clojure; essentially, this is a lisp implemented on the JVM. I will not describe the virtues or otherwise of Lisp in great detail; for some reason it is one of those languages that tends to generate fanaticism, and there are lots of descriptions of lisp elsewhere. For my purposes, there were three advantages. The first was personal, which is that I know Lisp reasonably well being an Emacs hacker. The other two are more general: Clojure has good integration with Java, and can manipulate Java objects, meaning I can make direct use of the OWL API; and, second, Lisp has a good degree of syntactic plasticity, which is important as, after all, I am looking for a convenient representation.

Initially, I have aimed at producing a representation which is fairly similar to Manchester syntax (http://www.w3.org/TR/owl2-manchester-syntax/). My initial attempts used the various features of Clojure directly. Consider, for instance, the following two statements:

(owl/owlclass
 "Arm" {:subclass "Limb"})

(owl/owlclass
 "HumanArm" {:subclass ["Limb" "HumanBodyPart"]})

Lisp, in general, uses a prefix notation. There is no obvious and easy way around this; in this case, it actually fits rather well with Manchester syntax which looks similar. The use of frame keywords such as :SubClassOf in Manchester syntax is also fortuitous as lisp uses a similar syntax. However, this syntax is rather too difficult. Even in this simple example we have a statement terminator which looks like ]}) (representing end of a vector, hash and sequence respectively). Lisp’s are often criticised for having too many parentheses; Clojure is unusual in using lots of different styles of parens. In Emacs-Lisp, I just keep hitting ) till I finished. In Clojure, you have to the brackets in the right order. All rather painful.

Fixing this turned out to be quite difficult, with a particularly nasty function I have called groupify. It is heavily recursive, which is apparently, a poor idea in Clojure, as it lacks some recursion optimisations present in many lisps; however, without mutable local variables, I could see no other option. The syntax now looks much simpler.

(owl/owlclass
 "Arm" :subclass "Limb")

(owl/owlclass
 "HumanArm" :subclass "Limb" "HumanBodyPart")

(owl/owlclass
 "Hand" :subclass (owl/some "isPartOf" "Arm"))

Both the :subclass and :equivalent frames support any number of class expressions; so far I have only implemented some or only, but the rest are not hard. Currently, it is only possible to save the ontology in Manchester syntax, but fixing this is trivial; the OWL API is doing all of the work.

Of course, this would not be much help if all I had managed to achieve was Manchester syntax with more parens. However, the big advantage of this becomes clearer with the next example:

(dorun
 (map
  (fn [x]
    (owl/owlclass
     (str "HumanChromosome" x)
     :subclass "HumanChromosome"))
  (concat '("X" "Y") (range 1 23))))

This creates a class for each human chromosome. In this case, I have hard coded the list of classes in, but I could be parsing a CSV or accessing a database. Or accessing an existing ontology; this could be very useful in avoiding maintenance of duplicate hierarchies.

Still, as it stands is just a (under-functional) version of OPPL. To make this worthwhile, I need to build off the language features that Clojure brings. I want to be able to interact with a reasoner, performing tasks in batch. In particular, the next step is to hook into Clojure’s test framework; something I have sorely missed when ontology building as opposed to programming. My experiences so far with combining Clojure and the OWL API suggest this should not be too hard.

These would not be minor advances; in the same way that test-driven programming has had a significant impact on the way we code, having a good test frame work for OWL would mean that we could define our use cases up-front, formally, programmatically and then fiddle with the logical representation till they work. As with test-driven programming, the test cases would themselves start to form part of the documentation for the code. When combined with a literate framework (http://www.russet.org.uk/blog/1213), to link between the ontology, the test cases and the experimental data that we are attempting to represent and model, this would provide a strong environment indeed. It would be a good step from moving from the craft-based approach we are taking at the moment, toward the pragmatic environment that I and others (http://robertdavidstevens.wordpress.com/2011/05/26/unicorns-in-my-ontology/) feel we need.

My code is available on Google code at http://code.google.com/p/clojure-owl/, and will be developed further there.

Bibliography

Recently, I and my PhD student, Jennifer Warrender have become interested in the representation of karyotypes. There are descriptions of chromosome complement of an individual. In essence, they are a birds-eye view of the genome. Normally, they are described using a karyotype string, so my karyotype would be 46,XY (probably!) which is normal male. When describing abnormalities, these can get very complex; take, for example, 46,XX,der(9),t(9;11)(p22;q23)t(11;12)(p13;q22)c,der(11)t(9;11)t(11;12)c,der(12)t(11;12)c[20] which describes a patient with multiple translocations.

There are a couple of reasons why we thought that it would be interesting to turn this into an ontological representation. The karyotype strings are not very parsable, and lack a computable specification which makes it very hard to check that they are correct, or to search and query over them. Having an ontological representation should help with the specification; additionally, using OWL we should be able to reason over by the specification and individual karyotype.

So, far it is turning out to be quite an interesting experience. The definitive resource for these strings is called ISCN (http://www.worldcat.org/title/iscn-2009/oclc/496946115). One of the things that we thought would happen, is that we would detect some inconsistencies in this specification; this often happens when producing a computable specification from a human-readable one, and it is not a reflection on the authors of the original. Sure enough, this is turning out to be the case. One the second page of the specification (after front matter), for instance, we find this statement:

Group G (21-22,Y): Short acrocentric chromosomes with satellites. The Y chromosome bears no satellites

pg7
— ISCN

The two sentences here are contradictory in themselves; either, the Y chromosome should not be in Group G, or Group G should not be defined to bear satellites.

There is an apparently similar statement on the page before, which says:

Not all the chromosomes in the D and G groups show satellites on their short arms in a single cell

pg6
— ISCN

However, this there is a significant difference here; to fill in the cytogenetic background, a satellite is a differentially staining visible body on the chromosome, found near the the centromere of some chromosomes. The name “satellite” actually comes from a different property of satellite DNA, that is is often a different density from bulk genomic DNA, so when spun in a density gradient, appears as a smaller band segregated from the main genomic band. Nowadays, satellite DNA is known to be highly repetitive sequence that we are no longer allowed to called Junk DNA (http://www.bbc.co.uk/news/health-19202141). In humans, the most common sequence is known as the alpha satellite. The different densities sometimes seen in gradients occurs where the GC content is different from bulk. The differential staining patterns seen cytogenetically occurs because the repetitive DNA is normally packed as heterochromatin.

Which leads us to understand the difference between our two quotes from ISCN. Although, repetitive DNA is variable in detail, it is not that variable, and will remain constant within an individual; if one Chromosome 22 has alpha satellite at the centromere, then so will another. The key here is the caveat “in a single cell”. In a chromosome spread from a single cell, an individual chromosome may or may not be at the right stage of condensation for the satellite to be visible.

So, we have two usages of the word chromosome. In the first quote, we are referring to a canonical chromosome; so, canonically, it is true that all human chromosome 22s contain a satellite. In the second statement, we are referring to a single chromosome, in a single cell, from a single human. There is no contradiction here because the existence of a single chromosome 22 without satellites is not inconsistent with the canonical chromosome 22 having satellites.

Actually, ontologically, the situation is slightly more complex still. The second quote says “Not all..show satellites” (our emphasis). It is not an issue of whether the chromosomes have or do not have the relevant DNA, it is whether we happen to be able to see this after appropriate staining.

We will consider both of these issues — canonicalization, and visualisation in future posts.


Authors

This post was authored by Phillip Lord and Jennifer Warrender.

Bibliography