## Archive for the ‘Tech’ Category

### Further Experiments with Literate Programming

Literate programming comes in many forms and disguises but is essentially the notion that the documentation and programmatic code should be written together, so that the documentation supports the code and vice versa. In this post, I discuss some of the problems with literate programming, my early attempts to circumvent these with respect to ontology development. Finally, I finish up with a description of some new technology which, I think, offers a solution.

## Literate Programming for Ontologies

The reality is, I think, that literate programming has never really take off; there are a large number of reasons for this of course. Code does not naturally have an linear narrative and is not necessarily read in this way: rather, when read by an experienced programmer, they often track the flow of execution through the code . A secondary problem is apparently quite trivial but the editing environment for literate programmes tends to be poor. I cannot find any good research on this, but this is both my experience and that of others .

For ontology development, I think a literate approach seems to make more sense. Again, in my experience, ontologies do have a somewhat more narrative approach than code — at least in the sense that the lack loops and the like.

## Initial Approaches

I have now been experimenting with literate techniques since 2009. The first version used a single latex file, and pulled these out into a Manchester syntax file . This worked quite nicely but suffered from the poor editor problem: I was building ontologies embedded in LaTeX, so lacked even the basic features (such as syntax highlighting) that I got when editing Manchester syntax files directly. This was a problem even with the very limited feature set from tools like omn-mode.el. The disadvantage would have been worse if I had been used to a richer environment for Manchester syntax.

My second attempt was took the opposite approach; now I used two files — a Manchester syntax file and a LaTeX one with a method for referring between the two . This worked okay but had a poor implementation which I later refined .

These approaches have their advantages but do both suffer from a poor editing environment; either in having two files to switch and link between, or favouring documentation over ontology or vice versa. They also suffered from a secondary issue, which is that they are based around Manchester syntax. While this is nice enough, since writing Tawny-OWL , this style of ontology development just feels not rich enough.

## Marginalia

One of the declared advantages of using a real programming language as the basic for Tawny-OWL was the ability to use the tools from that language; I have used a number of these both within Tawny-OWL and with ontologies written with Tawny: mostly obviously, the test environment, but also serialisation, properties support and, of course, the entire editing environment.

This raises the question as to whether I could use literate programming tools from Clojure as well. To my knowledge, the only real option in town here is Marginalia. Marginalia uses markdown as the documentation format and builds a nice presentation with code on one side, and comments on the other.

However, it has problems. Firstly, it presents all comments as text — you cannot comment the comments as it were which is irritating for boilerplate such as licence text. Secondly, the side-by-side presentation breaks the flow of reading as you have to move your eyes around the screen all the time. And, finally, it’s Markdown. While Markdown is nice at what it does, it’s very limited, and I missed the extra power of something like LaTeX.

The main difficulty, though, remains the editing environment. Without special support, while editing the comments show up as just comments. I can never remember the order of brackets in Markdown links — I rely on syntax highlighting to tell me that I have it correct.

Is there a way that Clojure and LaTeX can be made to work together?

My first thought, in experimenting with LaTeX was a remarkably cheap and cheerful one. Consider a document such as this:

 ;; \documentclass{article} ;; \begin{document} ;; \begin{code} (println "hello world") ;; \end{code} ;; \end{document}

This is a valid Clojure file, and is nearly valid latex as well. The only illegal part is that ;; occurs before the documentclass macro, although, in practice having ;; appear randomly throughout the document would not be ideal either.

Now, LaTeX as an embedded markup language has a very plastic syntax, and I have used that in this case. It is actually very easy to just ignore the ; character entirely, through the use of Catcodes; we can put this into a driver file which then inputs our Clojure file like so:

 \catcode;=9 \input{file.clj}

This way we maintain the validity of our Clojure file (otherwise the first line would be illegal). This is a remarkably cheap and cheerful way of achieving our aims; albeit at the cost of losing the ability to use semi-colons in our writing.

## Indirect-buffers

What, however, about the editing environment. My own preferred environment — Emacs — has nice modes which edit both LaTeX and Clojure code, and it is possible to switch between the two, when I want to move between editing code and editing documentation. This is quite clunky, but there is a second option which is “indirect-buffers”. This is a piece of Emacs arcana where two buffers share some of the same data structures but not all, which means that they can have different modes. Unfortunately, my experience is that the buffers share too much — as well as the text, they also share “text-properties” which unfortunately both LaTeX and Clojure mode use. In practice, this means syntax highlighting fails (or rather than two representations fight with each other). As a second problem, although the file is valid LaTeX it is not normal LaTeX; simply things like wrapping text in paragraphs fails because of the ;; comments at the beginning of each line.

So, this experiment fails the editing environment test.

My next attempt was to use block comments. Consider this file which is valid lisp using #| and |# block comments.

 #| \documentclass{article} \begin{document} \begin{code} |# (println "hello world") #| \end{code} \end{document} |#

We can use a similar (but not identical) trick with catcodes to make this valid latex also:

 \catcode#=\active \def#{\catcode#=6} \catcode|=\active \def|{\catcode|=12} \input{hello_world.lisp}

The first call makes the # character active — that is, definable as a macro. We then define # as a macro which will set the catcode of # to 6 (which is it’s default). Then, we do the same with |. The practical upshot of this is that the opening #| does nothing other than reset everything in the driver file; effectively it’s ignored.

This actually works quite nicely in the editing environment; the opening #| effectively makes no difference to Emacs, and the mode works well. The only real disadvantage is that every code block needs two delimiters — one to open the code block in latex, and end the comment in Lisp.

Now there are various multi-mode tools around for Emacs which should help solve the otherwise clunky editing environment, although even here I am not convinced that this is the right route. Multi-mode tools are complex and to some extent are not what I want — when editing code I want to suppress the documentation, give it a low visual immediacy, and when editing documentation, I want the reverse.

There is, however, a bigger problem — while the last example is valid Common Lisp, Clojure does not have block comments, nor does the programmer have the ability to extend the reader in this way. So, while this seems a nice solution, it depends on a specific language feature which Clojure lacks.

## Emacs Experiments: formats

My next idea was to use formats. Emacs allows transformations to happen between the text that is visualised on screen, and how it is saved to file. The main reason for this is to support the many non-ascii text formats that exist. But it is (perhaps unsurprisingly) fully extensible within Emacs and could be used for any purpose. So, why not convert line-commented Clojure on file into block-comments on screen; this will give editable latex on screen and valid Clojure on file and a driver file to give valid Latex on file also.

Unfortunately, it fails. While Emacs latex support is file based, Clojure (and specifically cider) has a tighter integration; it can communicate the contents of a buffer without saving to file. This circumvents the formatting — the block comments are sent to Clojure process which complains.

I am now experimenting with another option. indirect-buffers place exactly the same text (and text-properties) into two buffers. Instead of sharing all the text, why not have two buffers with a function that can transform the text bi-directionally between the two. The practical result is two views over the same content. Surprisingly, this works pretty well, as you can see here even though my current implementation is very simple — the whole buffer is copied every keypress. We could achieve the same thing with indirect-buffers, but as well as simple copying, however, we can also transform the text on the fly so that both buffers are valid for their respective modes.

The broad idea is not that new — it’s similar to web/weave or SWeave for instance, except that it is embedded into the editor; this means we can take advantage of existing support for both languages from the editor and the author gets immediate feedback about the transformation — so messing up the syntax is pretty obvious.

It also provides a superset of functionality provided by other techniques: indirect-buffers as mentioned previously, shadowfile.el (which creates a second copy of a file somewhere else on every save), and it could also mimic shadow.el which generates a secondary file by a command invocation on every save (although an invocation of an external command every keypress would probably not be performant).

The first release of linked-buffer was a month ago. I am currently unhappy with the configuration and will change this so code is in flux at the moment, but I am using it in anger which is a good sign. Currently, it does a latex <-> clojure transformation, but I will add a few more as time goes on.

## Discussion

It has taken me quite a while to get to this stage, and a number of experiments along the way, but my feeling is that I now have a workable literate environment. It also validates my decision to build Tawny . Having a rich textual language for building ontologies is a bit of a game changer; providing programmatic extensions to the language has been helpful, but the access to other tools, git, travis, tests and a repl has really made the difference. Now adding a literate environment to this as well changes the way that I can use ontologies and is a paradigm shift in their development.

## Bibliography

### Tawny 1.0

I am please to announce the first full release of Tawny-OWL, my library for fully programmatic development of OWL ontologies. The library now has a fairly large feature set:

• Complete support for OWL2

• Integrated support for reasoning with HermiT or ELK

• Profile checking

• Fixtures and support macros for unit testing

• Use of external ontologies available only as OWL files

• Rendering of OWL API objects to Tawny code.

• Support for generating and using ontologies with numeric IDs.

• Support for multilingual labels.

Additionally, I now have initial integration with Protege, described later.

The library is now available from clojars or on github.

• Background

A little over a year ago, I first described my experiments with building a programmatic environment for ontology construction . The need arose out of frustration with existing ontology tools; Protege, for example, provides a nice graphical environment, but it has many limitations. It does not easily allow automated generation of ontology entities, for example, and it also does not provide access to tools which are common place in an IDE: versioning tools, diffs, test cases and so forth. While ontology specific variations of these tools do exist, they were not as good as the ones I was used to use when programming.

Tawny seeks to bridge this gap, by using a full programmatic environment to generate OWL ontologies. I chose Clojure because of its syntactic plasticity; at its simplest, when using tawny, it does not feel like a programming language, just a syntax and evaluation engine for writing ontologies. However, the full power of the programming language is there and can be used when necessary .

Since the first blog post, there have now been a further 8, as well as three papers, describing tawny itself , the karyotype ontology and our use of patterns, higher-level abstractions within the karyotype ontology and applied to SIO. From an initial experiment, tawny has become a useful tool which we are using on a daily basis.

• In Early Release

Included in this release is our initial integration with Protege. Tawny builds on the OWL API, which is also the basis for Protege. I always assumed that Protege would be used to view OWL files generated by Tawny, but it is actually possible to integrate them much more comprehensively than this. It is now possible to directly manipulate the data structures of Protege using Tawny; in short, Protege can display what ever tawny has generated immediately, and without a file in between.

We have achieved this in two ways; firstly with protege-tawny which provides a command line environment directly inside Protege. This is useful, but does not provide the rich programmatic IDE that I want. However, the protege-nrepl environment allows exactly this; protege launches Clojure, and launches a NREPL server to which you connect with Emacs, Eclipse or any of the other Clojure IDEs. Finally, lein-sync allows syncing classpaths and dependencies with an existing Clojure project. The practical upshot can be seen in screencast; tawny can be used as normal, with Protege following.

Currently, Protege and Tawny use different versions of the OWL API, so while the protege-nrepl can be used with the current release of Protege, periodic crashes happen. In the meantime, a hand-built distribution of Protege is available including nrepl.

• For the Future

I have three main aims for the next few releases of Tawny. First, we need to provide access to explanation code; currently, this has to be accessed within Protege, which is less than ideal for a process than can take many minutes to run. I wish to integrate this with the Clojure unit test environment so that explanations will be generated by failing test cases.

Second, tawny currently allows the development of ontologies, but does not allow easy querying over them. I have several possibilities here: including integration of a SPARQL engine; the current rendering engine, combined perhaps with core.match, or, finally, fully-fledged support for core.match directly.

Finally, I wish to experiment with and add support for connection points , to better enable modular ontology development.

## Bibliography

### Tawny 0.12

I am pleased to announce the release of tawny-owl, Version 0.12.

This package allows users to construct OWL ontologies in a fully programmatic environment, namely Clojure. This means the user can take advantage of programmatic language to automate and abstract the ontology over the development process; also, rather than requiring the creation of ontology specific development environments, a normal programming IDE can be used; finally, a human readable text format means that we can integrate with the standard tooling for versioning and distributed development.

OWL is a W3C standard ontology representation language; an ontology is a fully computable set of statements, describing the things and their relationships. These statements can be reasoned over, inferences made and contradictions detected automatically using an off-the-shelf reasoner.

0.12 is planned to be the final, feature complete release before the 1.0 release. New features will not be added before 1.0. Key new features in 0.12 are:

• Complete support for OWL 2, include data types
• OWL documentation can be queries as normal clojure metadata
• New namespaces, query and fixture
• Completion of rendering functionality
• Regularisation of interfaces: where relevant functions now take an ontology as the first argument.
• Updated to Hermit 1.3.7.3, OWL API 3.4.5

Tawny is available at https://github.com/phillord/tawny-owl, or as a maven artifact from http://clojars.org. The development of tawny-owl is documented in my journal at http://www.russet.org.uk/blog/category/all/professional/tech/tawny-owl

### Data Properties in Tawny

Although it appears fairly innocuous, the last commit to tawny-owl seems momentus to me. While I still need to go through the spec line-by-line, and the code needs some clean up, this commit essentially represents the completion of the tawny.owl namespace; the addition of data properties and data types was the last part of the spec that I have to fulfil.

When I started off the tawny-owl library in October I was most interested in getting a test environment, and the ability to use a normal editor. Subsequently, and particularly in the course of writing up my first paper on this library , it became obvious to me that I needed to support all of OWL2. I think I have achieved my original design motivations and some more besides. I have also learned a lot about OWL, the OWL API and Manchester syntax. It is also a strange project, because it is the first time I have fulfilled a specification in quite this way. I cannot recall the last time I could reasonably be said to have finished something, as research is generally open-ended.

I did not, however, start with a regular syntax in mind. In general, the conversion to lisp has worked reasonably well: the object side of OWL in particular falls into a prefix, lisp syntax very naturally; the individual side less so. The data side of OWL had another surprise in store for me: it looks very similar to the object side; so I wanted to share syntax. However, all the Java method calls are named differently and take different types and parameter number.

In the end, I have supported this through a multi-method and some heuristics to guess which call is wanted. For instance, with these two calls from the pizza ontology:

 (owlsome hasTopping CheeseTopping) (owlsome hasCalorifiContentValue (span =< 400))

we generate quite different types of OWL object. The owlsome method defers to either object-some or data-some respectively, which can also be used directly. In this case, the difference is obvious; however, tawny also takes strings in most of these places; in this case, we convert to an IRI and check whether it exists in the ontology or any ontology we know about first. I suspect that these heuristics will work in most cases, but fail in some; only time and experience will tell me about these.

Before the next release, 0.12, I will finish both the inline, function documentation and update the tutorial. After this I plan to sit on the API a while, think about the functions and the syntax to make sure I am happy; the release after should be 1.0 and as is the way of these things, I will be stuck with the apperance of the API for quite a while. This also allows me to avoid a 0.13 release without accusation of superstition.

There are still many parts of tawny that I wish to improve; in particular, I need to extend the repl facilities with doc and apropos features — my attempt to hijack the Clojure native facilities have failed despite extensive efforts. And explanation code needs to go in; currently, waiting for protege to reason and produce these results in a soul destroying experience; I want me continuous integration tests to automatically dump explanations whenever inconsistencies happen.

But new features are for the future; for this iteration, tawny-owl is finished and now will be polished.

## Bibliography

### Supporting OBO style identifiers in Tawny

Tawny-OWL is a library which enables the programmatic construction of OWL . One of the limitations with tawny as it stands is that it did not implement numeric, semantics free identifiers ; tawny builds identifiers from the clojure symbols used to describe the class. So, in my pizza ontology, for instance, PizzaTopping gets an iri ending in PizzaTopping. Semantics free identifiers have some significant advantages; the principle one is that the establish an identity for an object which can persist even if the properties (the labels for instance) change, as I have described previously .

However, semantics-free identifiers do not come for free; they also have significant disadvantages, mainly that they make the life of developers harder and code less readable . I’ve previously suggested solutions to this problem when it afflicts OWL Manchester syntax .

With tawny, the IRIs that are used to identify concepts can easily be separated from the clojure symbols that are used to identify them; the initial link between them was simply one of convienience. So supporting numeric IRIs was possible with very little adjustment of the core owl.clj required one fixed function call to become a call to a first-class function.

One of purposes of tawny is to enable to a more agile development methodology than we have at present, so clearly I did not want the developer to have to manage this process by hand. Moreover, as recent discussions on the OBI mailing list, the issue of co-ordination of identifiers can be a significant difficult. As James Malone has recently described, there the URIgen tool offers a solution to this problem . Simon Jupp who is the primary developer of URIgen kindly discussed the details with me, which has helped me form my ideas about a suitable workflow, and I have borrowed heavily from URIgen (and the protege plugin) for this. While I will probably implement a URIgen client for tawny in the future, my initial approach uses a slightly different idea. In general, with tawny, I have been advocating using standard software development tools, instead of specific ontology ones ; rather than co-ordinating developers through the use of a centralised server, it seems to me to make more sense to use whatever version control system. To that end, I have implemented a file based system for storing identifiers; given that most bio-ontologies remain under the 50,000 terms size, I think that this is plausible, especially as it is simply in tawny to modularise the source (if not the ontology which remains a hard research problem). In this case, I have used a properties files, since it is a simple and human-readable format.

This works as follows. First, we define a new ontology, with an iri-gen frame, which use the obo-iri-generate function. Of course, this is generic so it is possible to use arbitrary strategies for generating an IRI.

 (defontology pizzaontology :iri "http://www.ncl.ac.uk/pizza-obo" :prefix "piz:" :comment "An example pizza using OBO style ids" :versioninfo "Unreleased Version" :annotation (seealso "Manchester Version") :iri-gen tawny.obo/obo-iri-generate )

Next, we need to restore the mapping between names and IRIs. We need to do this before we create any classes. In the first instance, this file will be empty, and will contain no mappings; this is not problematic.

 (tawny.obo/obo-restore-iri "./src/tawny/obo/pizza/pizza_iri.props")

Now, we define concepts, properties and so forth as normal.

 (defclass CheeseTopping :label "Cheese Topping") (defclass MeatTopping :label "Meat Topping")

The difference in how the IRI is created should be transparent to the developer at this point. Behind the scenes were are using this logic.

 (defn obo-iri-generate-or-retrieve [name remembered current] (or (get remembered name) (get current name) (str obo-pre-iri "#" (java.util.UUID/randomUUID))))

Or, in English: if the name (“CheeseTopping”) has been stored in our properties file, use this IRI; or if the name has already been used in the current session use this IRI, failing that, create a random UUID. I have used a UUID rather than autominting new identifiers because tawny is programmatic; it is very easy to create 1000 concepts where you meant to create 10 which would result in a lot of new identifiers. It makes more sense to mint permanent identifiers explicitly, as part of a release process.

This also works for programmatic use of tawny, regardless of whether concepts are added to the local namespace. This code creates many classes all at once, but does not add them to the namespace. Their IDs will still be stored.

 (doseq [n (map #(str "n" %) (range 1 20))] (owlclass n) )

Finally, we need to store the IRIs we have created. Both full IDs and UUIDs are stored; so new classes will get a random UUID, but it will persist over time, providing some interoperability with external users who can use the short-term identifier in the knowledge that it may change.

 (tawny.obo/obo-store-iri "./src/tawny/obo/pizza/pizza_iri.props")

At the same time, we report obsolete terms. These are those with permanent identifers, which are present in the properties file, but have not been created in the current file. Currently, these are just printed to screen, but I could generate classes and place them under an “obsolete” superclass.

 (tawny.obo/obo-report-obsolete)

Finally, at release point, a single function is called to generate the new IDs. This is done numerically, starting from the largest ID. If there are multiple developers, this step has to be co-ordinated, or it is going to break; but this is little different from a release point of any software project.

 (tawny.obo/obo-generate-permanent-iri "./src/tawny/obo/pizza/pizza_iri.props" "http://www.ncl.ac.uk/pizza-obo/PIZZA_")`

I think this workflow makes sense, but only use in practice will show for sure. If the requirement for co-ordination over minting of real IDs is problematic, then URIgen would provide a nice solution. I can also see problems with my use of props files; I have sorted them numerically which makes them easier to read (and predicatably ordered), but this has the disadvantage that changes are likely to happen near the end, which is likely to result in conflicts. While these would be relatively simple conflicts, merging is necessarily painful. This could be avoiding by storing permanent IDs in one file, and UUIDs in per-developer files.

This is the last feature I am planning to add to the current iteration of tawny; I want to complete the documentation for all functions (this has already been done for owl.clj, but not the other namespaces), and the tutorial. For the 0.12 cycle, I plan to make tawny complete for OWL2 (basically, this means adding datatypes).

This articles describes a SNAPSHOT of tawny, available on github (https://github.com/phillord/tawny-owl). All the examples shown here, come from (yet another!) version of the pizza ontology, also available on github (https://github.com/phillord/tawny-obo-pizza).

## Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

• Phillip Lord

## Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.

## Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

### Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

### Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.