An Exercise in Irrelevance - Supporting OBO style identifiers in Tawny

Tawny-OWL (n.d.a) is a library which enables the programmatic construction of OWL (n.d.b) One of the limitations with tawny as it stands is that it did not implement numeric, semantics free identifiers (n.d.c) tawny builds identifiers from the clojure symbols used to describe the class. So, in my pizza ontology, for instance, PizzaTopping gets an iri ending in PizzaTopping. Semantics free identifiers have some significant advantages; the principle one is that the establish an identity for an object which can persist even if the properties (the labels for instance) change, as I have described previously (n.d.d)

However, semantics-free identifiers do not come for free; they also have significant disadvantages, mainly that they make the life of developers harder and code less readable (n.d.c) I’ve previously suggested solutions to this problem when it afflicts OWL Manchester syntax (n.d.e)

With tawny, the IRIs that are used to identify concepts can easily be separated from the clojure symbols that are used to identify them; the initial link between them was simply one of convienience. So supporting numeric IRIs was possible with very little adjustment of the core owl.clj required one fixed function call to become a call to a first-class function.

One of purposes of tawny is to enable to a more agile development methodology than we have at present, so clearly I did not want the developer to have to manage this process by hand. Moreover, as recent discussions on the OBI mailing list, the issue of co-ordination of identifiers can be a significant difficult. As James Malone has recently described, there the URIgen tool offers a solution to this problem (n.d.f) Simon Jupp who is the primary developer of URIgen kindly discussed the details with me, which has helped me form my ideas about a suitable workflow, and I have borrowed heavily from URIgen (and the protege plugin) for this. While I will probably implement a URIgen client for tawny in the future, my initial approach uses a slightly different idea. In general, with tawny, I have been advocating using standard software development tools, instead of specific ontology ones (n.d.b) rather than co-ordinating developers through the use of a centralised server, it seems to me to make more sense to use whatever version control system. To that end, I have implemented a file based system for storing identifiers; given that most bio-ontologies remain under the 50,000 terms size, I think that this is plausible, especially as it is simply in tawny to modularise the source (if not the ontology which remains a hard research problem). In this case, I have used a properties files, since it is a simple and human-readable format.

This works as follows. First, we define a new ontology, with an iri-gen frame, which use the obo-iri-generate function. Of course, this is generic so it is possible to use arbitrary strategies for generating an IRI.

(defontology pizzaontology
  :iri "http://www.ncl.ac.uk/pizza-obo"
  :prefix "piz:"
  :comment "An example pizza using OBO style ids"
  :versioninfo "Unreleased Version"
  :annotation (seealso "Manchester Version")
  :iri-gen tawny.obo/obo-iri-generate
  )

Next, we need to restore the mapping between names and IRIs. We need to do this before we create any classes. In the first instance, this file will be empty, and will contain no mappings; this is not problematic.

(tawny.obo/obo-restore-iri "./src/tawny/obo/pizza/pizza_iri.props")

Now, we define concepts, properties and so forth as normal.

(defclass CheeseTopping
  :label "Cheese Topping")
(defclass MeatTopping
  :label "Meat Topping")

The difference in how the IRI is created should be transparent to the developer at this point. Behind the scenes were are using this logic.

(defn obo-iri-generate-or-retrieve
  [name remembered current]
  (or (get remembered name)
      (get current name)
      (str obo-pre-iri "#"
           (java.util.UUID/randomUUID))))

Or, in English: if the name (“CheeseTopping”) has been stored in our properties file, use this IRI; or if the name has already been used in the current session use this IRI, failing that, create a random UUID. I have used a UUID rather than autominting new identifiers because tawny is programmatic; it is very easy to create 1000 concepts where you meant to create 10 which would result in a lot of new identifiers. It makes more sense to mint permanent identifiers explicitly, as part of a release process.

This also works for programmatic use of tawny, regardless of whether concepts are added to the local namespace. This code creates many classes all at once, but does not add them to the namespace. Their IDs will still be stored.

(doseq [n (map #(str "n" %) (range 1 20))]
  (owlclass n)
   )

Finally, we need to store the IRIs we have created. Both full IDs and UUIDs are stored; so new classes will get a random UUID, but it will persist over time, providing some interoperability with external users who can use the short-term identifier in the knowledge that it may change.

(tawny.obo/obo-store-iri "./src/tawny/obo/pizza/pizza_iri.props")

At the same time, we report obsolete terms. These are those with permanent identifers, which are present in the properties file, but have not been created in the current file. Currently, these are just printed to screen, but I could generate classes and place them under an “obsolete” superclass.

(tawny.obo/obo-report-obsolete)

Finally, at release point, a single function is called to generate the new IDs. This is done numerically, starting from the largest ID. If there are multiple developers, this step has to be co-ordinated, or it is going to break; but this is little different from a release point of any software project.

(tawny.obo/obo-generate-permanent-iri "./src/tawny/obo/pizza/pizza_iri.props" "http://www.ncl.ac.uk/pizza-obo/PIZZA_")

I think this workflow makes sense, but only use in practice will show for sure. If the requirement for co-ordination over minting of real IDs is problematic, then URIgen would provide a nice solution. I can also see problems with my use of props files; I have sorted them numerically which makes them easier to read (and predicatably ordered), but this has the disadvantage that changes are likely to happen near the end, which is likely to result in conflicts. While these would be relatively simple conflicts, merging is necessarily painful. This could be avoiding by storing permanent IDs in one file, and UUIDs in per-developer files.

This is the last feature I am planning to add to the current iteration of tawny; I want to complete the documentation for all functions (this has already been done for owl.clj, but not the other namespaces), and the tutorial. For the 0.12 cycle, I plan to make tawny complete for OWL2 (basically, this means adding datatypes).

This articles describes a SNAPSHOT of tawny, available on github (https://github.com/phillord/tawny-owl). All the examples shown here, come from (yet another!) version of the pizza ontology, also available on github (https://github.com/phillord/tawny-obo-pizza).

n.d.b. https://www.russet.org.uk/blog/2366.

———. n.d.a. https://www.russet.org.uk/blog/2214.

———. n.d.c. https://www.russet.org.uk/blog/2040.

———. n.d.d. https://www.russet.org.uk/blog/1908.

———. n.d.e. https://www.russet.org.uk/blog/1470.

———. n.d.f. https://jamesmaloneebi.blogspot.co.uk/2013/04/keeping-it-agile-secret-to-fitter.html.