Archive for the ‘Tech’ Category

Although it appears fairly innocuous, the last commit to tawny-owl seems momentus to me. While I still need to go through the spec line-by-line, and the code needs some clean up, this commit essentially represents the completion of the tawny.owl namespace; the addition of data properties and data types was the last part of the spec that I have to fulfil.

When I started off the tawny-owl library in October (http://www.russet.org.uk/blog/2214) I was most interested in getting a test environment, and the ability to use a normal editor. Subsequently, and particularly in the course of writing up my first paper on this library (http://www.russet.org.uk/blog/2366), it became obvious to me that I needed to support all of OWL2. I think I have achieved my original design motivations and some more besides. I have also learned a lot about OWL, the OWL API and Manchester syntax. It is also a strange project, because it is the first time I have fulfilled a specification in quite this way. I cannot recall the last time I could reasonably be said to have finished something, as research is generally open-ended.

I did not, however, start with a regular syntax in mind. In general, the conversion to lisp has worked reasonably well: the object side of OWL in particular falls into a prefix, lisp syntax very naturally; the individual side less so. The data side of OWL had another surprise in store for me: it looks very similar to the object side; so I wanted to share syntax. However, all the Java method calls are named differently and take different types and parameter number.

In the end, I have supported this through a multi-method and some heuristics to guess which call is wanted. For instance, with these two calls from the pizza ontology:

(owlsome hasTopping CheeseTopping)
(owlsome hasCalorifiContentValue (span =< 400))

we generate quite different types of OWL object. The owlsome method defers to either object-some or data-some respectively, which can also be used directly. In this case, the difference is obvious; however, tawny also takes strings in most of these places; in this case, we convert to an IRI and check whether it exists in the ontology or any ontology we know about first. I suspect that these heuristics will work in most cases, but fail in some; only time and experience will tell me about these.

Before the next release, 0.12, I will finish both the inline, function documentation and update the tutorial. After this I plan to sit on the API a while, think about the functions and the syntax to make sure I am happy; the release after should be 1.0 and as is the way of these things, I will be stuck with the apperance of the API for quite a while. This also allows me to avoid a 0.13 release without accusation of superstition.

There are still many parts of tawny that I wish to improve; in particular, I need to extend the repl facilities with doc and apropos features — my attempt to hijack the Clojure native facilities have failed despite extensive efforts. And explanation code needs to go in; currently, waiting for protege to reason and produce these results in a soul destroying experience; I want me continuous integration tests to automatically dump explanations whenever inconsistencies happen.

But new features are for the future; for this iteration, tawny-owl is finished and now will be polished.

Bibliography

Tawny-OWL (http://www.russet.org.uk/blog/2214) is a library which enables the programmatic construction of OWL (http://www.russet.org.uk/blog/2366). One of the limitations with tawny as it stands is that it did not implement numeric, semantics free identifiers (http://www.russet.org.uk/blog/2040); tawny builds identifiers from the clojure symbols used to describe the class. So, in my pizza ontology, for instance, PizzaTopping gets an iri ending in PizzaTopping. Semantics free identifiers have some significant advantages; the principle one is that the establish an identity for an object which can persist even if the properties (the labels for instance) change, as I have described previously (http://www.russet.org.uk/blog/1908).

However, semantics-free identifiers do not come for free; they also have significant disadvantages, mainly that they make the life of developers harder and code less readable (http://www.russet.org.uk/blog/2040). I’ve previously suggested solutions to this problem when it afflicts OWL Manchester syntax (http://www.russet.org.uk/blog/1470).

With tawny, the IRIs that are used to identify concepts can easily be separated from the clojure symbols that are used to identify them; the initial link between them was simply one of convienience. So supporting numeric IRIs was possible with very little adjustment of the core owl.clj required one fixed function call to become a call to a first-class function.

One of purposes of tawny is to enable to a more agile development methodology than we have at present, so clearly I did not want the developer to have to manage this process by hand. Moreover, as recent discussions on the OBI mailing list, the issue of co-ordination of identifiers can be a significant difficult. As James Malone has recently described, there the URIgen tool offers a solution to this problem (http://jamesmaloneebi.blogspot.co.uk/2013/04/keeping-it-agile-secret-to-fitter.html). Simon Jupp who is the primary developer of URIgen kindly discussed the details with me, which has helped me form my ideas about a suitable workflow, and I have borrowed heavily from URIgen (and the protege plugin) for this. While I will probably implement a URIgen client for tawny in the future, my initial approach uses a slightly different idea. In general, with tawny, I have been advocating using standard software development tools, instead of specific ontology ones (http://www.russet.org.uk/blog/2366); rather than co-ordinating developers through the use of a centralised server, it seems to me to make more sense to use whatever version control system. To that end, I have implemented a file based system for storing identifiers; given that most bio-ontologies remain under the 50,000 terms size, I think that this is plausible, especially as it is simply in tawny to modularise the source (if not the ontology which remains a hard research problem). In this case, I have used a properties files, since it is a simple and human-readable format.

This works as follows. First, we define a new ontology, with an iri-gen frame, which use the obo-iri-generate function. Of course, this is generic so it is possible to use arbitrary strategies for generating an IRI.

(defontology pizzaontology
  :iri "http://www.ncl.ac.uk/pizza-obo"
  :prefix "piz:"
  :comment "An example pizza using OBO style ids"
  :versioninfo "Unreleased Version"
  :annotation (seealso "Manchester Version")
  :iri-gen tawny.obo/obo-iri-generate
  )

Next, we need to restore the mapping between names and IRIs. We need to do this before we create any classes. In the first instance, this file will be empty, and will contain no mappings; this is not problematic.

(tawny.obo/obo-restore-iri "./src/tawny/obo/pizza/pizza_iri.props")

Now, we define concepts, properties and so forth as normal.

(defclass CheeseTopping
  :label "Cheese Topping")
(defclass MeatTopping
  :label "Meat Topping")

The difference in how the IRI is created should be transparent to the developer at this point. Behind the scenes were are using this logic.

(defn obo-iri-generate-or-retrieve
  [name remembered current]
  (or (get remembered name)
      (get current name)
      (str obo-pre-iri "#"
           (java.util.UUID/randomUUID))))

Or, in English: if the name (“CheeseTopping”) has been stored in our properties file, use this IRI; or if the name has already been used in the current session use this IRI, failing that, create a random UUID. I have used a UUID rather than autominting new identifiers because tawny is programmatic; it is very easy to create 1000 concepts where you meant to create 10 which would result in a lot of new identifiers. It makes more sense to mint permanent identifiers explicitly, as part of a release process.

This also works for programmatic use of tawny, regardless of whether concepts are added to the local namespace. This code creates many classes all at once, but does not add them to the namespace. Their IDs will still be stored.

(doseq [n (map #(str "n" %) (range 1 20))]
  (owlclass n)
   )

Finally, we need to store the IRIs we have created. Both full IDs and UUIDs are stored; so new classes will get a random UUID, but it will persist over time, providing some interoperability with external users who can use the short-term identifier in the knowledge that it may change.

(tawny.obo/obo-store-iri "./src/tawny/obo/pizza/pizza_iri.props")

At the same time, we report obsolete terms. These are those with permanent identifers, which are present in the properties file, but have not been created in the current file. Currently, these are just printed to screen, but I could generate classes and place them under an “obsolete” superclass.

(tawny.obo/obo-report-obsolete)

Finally, at release point, a single function is called to generate the new IDs. This is done numerically, starting from the largest ID. If there are multiple developers, this step has to be co-ordinated, or it is going to break; but this is little different from a release point of any software project.

(tawny.obo/obo-generate-permanent-iri "./src/tawny/obo/pizza/pizza_iri.props" "http://www.ncl.ac.uk/pizza-obo/PIZZA_")

I think this workflow makes sense, but only use in practice will show for sure. If the requirement for co-ordination over minting of real IDs is problematic, then URIgen would provide a nice solution. I can also see problems with my use of props files; I have sorted them numerically which makes them easier to read (and predicatably ordered), but this has the disadvantage that changes are likely to happen near the end, which is likely to result in conflicts. While these would be relatively simple conflicts, merging is necessarily painful. This could be avoiding by storing permanent IDs in one file, and UUIDs in per-developer files.

This is the last feature I am planning to add to the current iteration of tawny; I want to complete the documentation for all functions (this has already been done for owl.clj, but not the other namespaces), and the tutorial. For the 0.12 cycle, I plan to make tawny complete for OWL2 (basically, this means adding datatypes).

This articles describes a SNAPSHOT of tawny, available on github (https://github.com/phillord/tawny-owl). All the examples shown here, come from (yet another!) version of the pizza ontology, also available on github (https://github.com/phillord/tawny-obo-pizza).

Bibliography


Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord

Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.

The Mercurial repository for KnowledgeBlog (http://knowledgeblog.org) has been starting to show the strain for a while now. Firstly, when it was created we were all new to mercurial; for instance it contains the trunk directory which is really a Subversion metaphor. The second problem is that it is a single large repository, which maps to the development directory on my hard drive; there is now a lot of experimental software on my hard drive which I don’t want in a public enviroment, so I am now faced with either an enormous .hgignore or more “untracked” files than tracked. Not ideal.

At the same time, I have more recently moved mostly toward using git; actually, I still think Mercurial is nicer than git; the interface to the commands is cleaner, and the functionality is not that different. However, there is a fantastic UI, magit, for Emacs, while the equivalent for Mercurial is not as good. This is important to me. So, I wanted to try and address both of the issues at the same time; splitting the repository upon, and move to git.

The process for achieving this turned out to be relatively simply; mercurial comes with a fantastic extension called convert. This is actually a general purpose extension to convert from other VCS systems into mercurial; however, it will also convert one hg repo to another. It has the ability to both filter the existing repo and rename locations at the same time. To create my new repository I used these commands:

mkdir mathjax-latex-hg
cd mathjax-latex-hg
## create filemap.txt
hg init
hg convert --filemap filemap.txt devel-hg-old/ .

which create a new Mercurial repository, and convert the data from the old, tangled repository. The filemap.txt file contains a couple of lines only:

include trunk/plugins/mathjax-latex
rename trunk/plugins/mathjax-latex .

These filter for just the mathjax-latex plugin and move all its files to top level. This is the only part of the process that needs changing to export different parts of the repo, as I done four times now. This now gives me a Mercurial repository in the right shape. Now, we create a new git repo, and import the untangled Mercurial repo into git. Again, reasonably straight-forward. hg-fast-export is the name of the command on ubuntu which is more sensible than original fast-export which is both overly generic, and a hostage to the future.

cd ..
git init mathjax-latex-git
cd mathjax-latex-git
hg-fast-export -r ../mathjax-latex-hg
git checkout HEAD

Finally, the repo needs to be made publicly available, in this case of github. And all is complete.

git remote add origin git@github.com:phillord/mathjax-latex.git
git push -u origin master

Of course, mathjax-latex does not actually need updating, because it is feature complete and working. However, the WordPress plugin page now includes a nasty warning, so I probably need to update it just to avoid this. Bit of a pain, especially the only way of doing this involves updating the Subversion repository, which I don’t actually use. Slightly painful.

Bibliography

Tawny OWL, my library for building ontologies (http://www.russet.org.uk/blog/2254) is now reaching a nice stage of maturity; it is possible to build ontologies, reason over them and so forth. We have already started to use the programmable nature of Tawny, trivially with disjoints (http://www.russet.org.uk/blog/2275), as well as allowing the ontology developer to choose the identifiers that they use to interact with the concepts (http://www.russet.org.uk/blog/2303). However, I wanted to explore further the usefulness of a programmatic environment.

One standard facility present in most languages is a test harness, and Clojure is no exception in this regard. Tawny already comes with a set of predicates for testing superclasses, both asserting and inferred, which provides a good basis for unit testing. So, this example using my test Pizza ontology shows a nice example, essentially testing definitions for CheesyPizza — these should in both a positive and negative definition.

(deftest CheesyShort
  (is (r/isuperclass? p/FourCheesePizza p/CheesyPizza))
  (is (r/isuperclass? p/MargheritaPizza p/CheesyPizza))
  (is
   (not (r/isuperclass? p/MargheritaPizza p/FourCheesePizza))))

While ths is nice, it is not enough in some cases where I wanted to test that things that do not happen. For this I introduce a new macro, with-probe-entities which adds “probe classes” into the ontology — that is a class which is there only for the purpose of a test. In this case, I test the definition of VegetarianPizza to see whether MargheritaPizza reasons correctly as a subclass. Additionally, though, I also check to see whether a subclass of VegetarianPizza and CajunPizza — which contains sausage — is inconsistent. This test could be more specific, as it tests for general coherency, although I do check for this independently. The with-probe-entities macro cleans up after itself. All entities (which can be of any kind and not just classes) are removed from the ontology afterwards; so independence of testing is not compromised).

(deftest VegetarianPizza
  (is
   (r/isuperclass? p/MargheritaPizza p/VegetarianPizza))

  (is
   (not
    (o/with-probe-entities
      [c (o/owlclass "probe"
                     :subclass p/VegetarianPizza p/CajunPizza)]
      (r/coherent?)))))

Of course, a natural consequence of the addition of tests is the desire to run them frequenty; more over, the desire to run them in a clean environment. The solution to this turns out to be simple. Travis-CI integrates nicely with github — so the addition of a simple YAML file of this form enables a Continuous Integration, of both the Pizza ontology and the environment (such as Tawny, for instance).

language: clojure
lein: lein2
jdk:
  - openjdk7

The output of this process is available for all to read, along with the tests for my mavenized version of Hermit, and also tawny itself. This is not the first time that ontologies have been continuously integrated (http://bio-ontologies.knowledgeblog.org/405); however, the nice advantage of this is that I have not had to install anything. It even works against external ontologies: so we have both GO and OBI. Currently, these work against static versions of GO and OBI. I could automate this process from the respective repositories of these projects, by pulling with git-svn and pushing again to github.

All in all, though, the process of recasting ontology building as a programming task is turning out to be an interesting experience. Much of the tooling that enables collaborative ontology building just works. It holds much promise for the future.

Bibliography

I have been working on a Clojure library for developing OWL ontologies (http://www.russet.org.uk/blog/2214). There have been two significant advances with this library recently. First, I have changed its name from clojure-owl to tawny-owl. I was never really happy with the original name; I think it is bad practice to name something after the language it uses (even partly, as the many jlibraries attest), and there was several other libraries around for manipulating OWL in clojure, albeit in different ways. “Tawny” is simple and straight-forward and memorable, I think. At the same time, I moved to Github because I can now just updated readme.md, rather than having to update a separate website.

Perhaps, more importantly, I have put in new code for handling change to external ontologies, which is particularly important for external libraries.

Throughout the development of tawny-owl, I have focused on provided an environment that is easy to use for the developer; so, classes, properties and other entities are represented as lisp symbols (http://www.russet.org.uk/blog/2254). This works well and produces very attractive looking code in, for example, my version of the Pizza ontology. I have also written code so that ontologies only available as OWL files can be treated as first-class citizens: very easy in a highly dynamic language like Lisp.

However, it causes problems when combined with an ontology such as OBI (10.1186/2041-1480-1-S1-S7). The difficulty here is that OBI uses semantics-free identifiers (http://www.russet.org.uk/blog/2040). While there are some good reasons for this, would result in Clojure of the form:

(defclass OBI:0034322
     :subclass OBI:0034321)

Clearly, this is not good, and something that I want to avoid. So, instead, we apply a transform function to OBI when importing it; basically, this munges the rdfs:label annotation, turning it into something that is a legal Clojure symbol.

  :transform
  ;; fix the space problem
  (fn [e]
    (clojure.string/replace
     ;; with luck these will always be literals, so we can do this
     ;; although not true in general
     (.getLiteral
      ;; get the value of the annotation
      (.getValue
       (first
        ;; filter for annotations which are labels
        ;; is lazy, so doesn't eval all
        (filter
         #(.. % (getProperty) (isLabel))
         ;; get the annotations
         (.getAnnotations e
                          (owl.owl/get-current-jontology))))))
     #"[ /]" "_"
     ))

All well and good. However, there is a problem. The label in OBI has two characteristics. First, it is human readable, which is good, and the reason why we are using it. Second, however, is does not carry formal semantics; the developers are free to change these labels when ever they like. Of course, any ontology that I build against by tawnyized version of OBI will break, because the label has changed. This is not a problem for a GUI like protege, because, perhaps ironically, GUIs are not WYSIWYG — what you see is actually a view of the underlying datamodel. So, protege shows you the label, but actually you are manipulating the URI. A dependency can change their labels, and when Protege reloads it, this is what the developer will see.

With code, on the other hand, there is no separation at all. If the label changes, I will have to update anything that refers to this, which seems a substantial problem. However, I have now managed to work around this. My new library memorise saves all the mappings into a file, then restores them when OBI is loaded. Any old labels that no longer exist but which point to an IRI that still does exist are generated as duplicate symbols pointing to the same OWL object; however, I have done this in a way that they will emit warnings both when loading, and during use, with a description of the new symbol name. This data would also make automatic upgrading possible, of course, using Clojure to perform a big search and replace on the source code. I think that this is a nicer solution than the denormalisation (http://www.russet.org.uk/blog/1470) or “colour cube” solution (http://www.russet.org.uk/blog/2040) that I previously suggested for Manchester syntax. It also shows off the advantage of using a programming language, rather than a static format; I, or any other user of the library, can just add this, as I choose, without having to wait for standardisation process, and tool support to catch up.

This will still leave a secondary problem; it is dependent on the IRI which for pre-release versions of OBI is not fixed, as documented. Of course, this problem could go away, if OBI used a tool like URIGen, or alternatively if OBI released more regularly. Still, the data should also allow a reverse lookup — finding out what IRI a label now has.

I think these are the main tools that are needed to build against an external resource. The 0.5 version of Tawny is now available on Clojars and Github.

Bibliography