I have been working on a Clojure library for developing OWL ontologies (http://www.russet.org.uk/blog/2214). There have been two significant advances with this library recently. First, I have changed its name from clojure-owl to tawny-owl. I was never really happy with the original name; I think it is bad practice to name something after the language it uses (even partly, as the many jlibraries attest), and there was several other libraries around for manipulating OWL in clojure, albeit in different ways. “Tawny” is simple and straight-forward and memorable, I think. At the same time, I moved to Github because I can now just updated readme.md, rather than having to update a separate website.

Perhaps, more importantly, I have put in new code for handling change to external ontologies, which is particularly important for external libraries.

Throughout the development of tawny-owl, I have focused on provided an environment that is easy to use for the developer; so, classes, properties and other entities are represented as lisp symbols (http://www.russet.org.uk/blog/2254). This works well and produces very attractive looking code in, for example, my version of the Pizza ontology. I have also written code so that ontologies only available as OWL files can be treated as first-class citizens: very easy in a highly dynamic language like Lisp.

However, it causes problems when combined with an ontology such as OBI (10.1186/2041-1480-1-S1-S7). The difficulty here is that OBI uses semantics-free identifiers (http://www.russet.org.uk/blog/2040). While there are some good reasons for this, would result in Clojure of the form:

(defclass OBI:0034322
     :subclass OBI:0034321)

Clearly, this is not good, and something that I want to avoid. So, instead, we apply a transform function to OBI when importing it; basically, this munges the rdfs:label annotation, turning it into something that is a legal Clojure symbol.

  :transform
  ;; fix the space problem
  (fn [e]
    (clojure.string/replace
     ;; with luck these will always be literals, so we can do this
     ;; although not true in general
     (.getLiteral
      ;; get the value of the annotation
      (.getValue
       (first
        ;; filter for annotations which are labels
        ;; is lazy, so doesn't eval all
        (filter
         #(.. % (getProperty) (isLabel))
         ;; get the annotations
         (.getAnnotations e
                          (owl.owl/get-current-jontology))))))
     #"[ /]" "_"
     ))

All well and good. However, there is a problem. The label in OBI has two characteristics. First, it is human readable, which is good, and the reason why we are using it. Second, however, is does not carry formal semantics; the developers are free to change these labels when ever they like. Of course, any ontology that I build against by tawnyized version of OBI will break, because the label has changed. This is not a problem for a GUI like protege, because, perhaps ironically, GUIs are not WYSIWYG — what you see is actually a view of the underlying datamodel. So, protege shows you the label, but actually you are manipulating the URI. A dependency can change their labels, and when Protege reloads it, this is what the developer will see.

With code, on the other hand, there is no separation at all. If the label changes, I will have to update anything that refers to this, which seems a substantial problem. However, I have now managed to work around this. My new library memorise saves all the mappings into a file, then restores them when OBI is loaded. Any old labels that no longer exist but which point to an IRI that still does exist are generated as duplicate symbols pointing to the same OWL object; however, I have done this in a way that they will emit warnings both when loading, and during use, with a description of the new symbol name. This data would also make automatic upgrading possible, of course, using Clojure to perform a big search and replace on the source code. I think that this is a nicer solution than the denormalisation (http://www.russet.org.uk/blog/1470) or “colour cube” solution (http://www.russet.org.uk/blog/2040) that I previously suggested for Manchester syntax. It also shows off the advantage of using a programming language, rather than a static format; I, or any other user of the library, can just add this, as I choose, without having to wait for standardisation process, and tool support to catch up.

This will still leave a secondary problem; it is dependent on the IRI which for pre-release versions of OBI is not fixed, as documented. Of course, this problem could go away, if OBI used a tool like URIGen, or alternatively if OBI released more regularly. Still, the data should also allow a reverse lookup — finding out what IRI a label now has.

I think these are the main tools that are needed to build against an external resource. The 0.5 version of Tawny is now available on Clojars and Github.

Bibliography

4 Comments

  1. Ignazio says:

    I wonder at the concept of semantics free identifiers and how it is used: granted, having semantics in the meaning of the identifier is bad, as it is not formalized, logical knowledge, and as such it does not fit well inside a logical language, apart from being hard to update when needed.

    But, on the other hand, debugging a problem in either an API to access an ontology or inside a reasoner (you might guess why I picked these two examples :-) yes, been there done that), when the IRIs flying around are fairly long strings of numbers, can make developer’s eyes fill up with tears – to me, it uses to remind me of my first days of programming, when I debugged C data structures and had to eyeball pointer values to spot the errors.

    I wonder if the issue is not with semantics rich identifiers, but rather with /readable/ identifiers. Suppose I have a convention to call all my classes JackFolla, JillSmithers, ThomasJefferson and JohnDoe (the last one is an anonymous one, of course): I reckon this would make debugging much easier than 0045, 000340056, 0055340056, and _334532, without violating the semantics free rule.

    Thoughts?
    Cheers,
    I.

  2. Phillip Lord says:

    I think this is a reasonable idea, although there is the practical problem
    of how to come up with names for the classes. A dictionary based approach
    might help.

    However, I fear that the tide of history is against you. For instance, in
    many parts of biology, we used to give genes semantics-freeish but memorable
    names. So “hedgehog” or it’s human equivalent “sonic hedgehog”. The problem
    comes when names get slightly humourous as with sonic. What happens when
    a doctor talks to a patient and tells them they have a hereditary condition
    in sonic hedgehog?

    Ultimately, though, I think we need to think about what we mean by a name
    change that *does not* change the semantics. Of course, minor spelling
    corrections do not fall into this, but anything more substantive?

  3. Ignazio says:

    Of course it wouldn’t be a client side label, same as you’d not show OBI0003342 to anyone that you want to understand what’s being shown to them :-) Actually I could whip together such an utility for my local use and see how it works from there. Would be funny if randomly picked words turned out to match the intended meaning of the classes – talk about serendipity.

    About name changes – under the assumption that the IRI has no meaning, no syntactic change would make any difference except when the IRIs are shared among ontologies. But I suppose that in this second case any change, be it a spelling mistake or an intentional rename, would have bad consequences; and, to a different extent, changes in labels would have the same consequence, I fear. I know there is a number of people searching ontologies like SNOMED with regular expressions on entity labels.

    Actually, the more I think about what problems might come up, the more it seems to me that names shouldn’t be changed :-)

  4. Phillip Lord says:

    Client-side labels always leak. This is, for example, a reason why OBO uses
    numeric IDs, when UUIDs would work just as well otherwise.

    In terms of random naming, you might want to have a look at Simon Jupp’s
    Urigen, allows shared generation of URIs. It should be possible to coerce this
    to use a dictionary based approach, and Protege already has a plugin. I will
    add a client for tawny-owl at some point also.

    The reality is that names DO change, often where there is no intended change
    in semantics. This includes spelling corrections and changes in overall naming
    policy. No name changes is an easy and convienient policy, of course, but it
    doesn’t necessarily work. Besides which in a multi-lingual environment, the
    question would be which label?

Leave a Reply