## Archive for the ‘Professional’ Category

### Further Experiments with Literate Programming

Literate programming comes in many forms and disguises but is essentially the notion that the documentation and programmatic code should be written together, so that the documentation supports the code and vice versa. In this post, I discuss some of the problems with literate programming, my early attempts to circumvent these with respect to ontology development. Finally, I finish up with a description of some new technology which, I think, offers a solution.

## Literate Programming for Ontologies

The reality is, I think, that literate programming has never really take off; there are a large number of reasons for this of course. Code does not naturally have an linear narrative and is not necessarily read in this way: rather, when read by an experienced programmer, they often track the flow of execution through the code . A secondary problem is apparently quite trivial but the editing environment for literate programmes tends to be poor. I cannot find any good research on this, but this is both my experience and that of others .

For ontology development, I think a literate approach seems to make more sense. Again, in my experience, ontologies do have a somewhat more narrative approach than code — at least in the sense that the lack loops and the like.

## Initial Approaches

I have now been experimenting with literate techniques since 2009. The first version used a single latex file, and pulled these out into a Manchester syntax file . This worked quite nicely but suffered from the poor editor problem: I was building ontologies embedded in LaTeX, so lacked even the basic features (such as syntax highlighting) that I got when editing Manchester syntax files directly. This was a problem even with the very limited feature set from tools like omn-mode.el. The disadvantage would have been worse if I had been used to a richer environment for Manchester syntax.

My second attempt was took the opposite approach; now I used two files — a Manchester syntax file and a LaTeX one with a method for referring between the two . This worked okay but had a poor implementation which I later refined .

These approaches have their advantages but do both suffer from a poor editing environment; either in having two files to switch and link between, or favouring documentation over ontology or vice versa. They also suffered from a secondary issue, which is that they are based around Manchester syntax. While this is nice enough, since writing Tawny-OWL , this style of ontology development just feels not rich enough.

## Marginalia

One of the declared advantages of using a real programming language as the basic for Tawny-OWL was the ability to use the tools from that language; I have used a number of these both within Tawny-OWL and with ontologies written with Tawny: mostly obviously, the test environment, but also serialisation, properties support and, of course, the entire editing environment.

This raises the question as to whether I could use literate programming tools from Clojure as well. To my knowledge, the only real option in town here is Marginalia. Marginalia uses markdown as the documentation format and builds a nice presentation with code on one side, and comments on the other.

However, it has problems. Firstly, it presents all comments as text — you cannot comment the comments as it were which is irritating for boilerplate such as licence text. Secondly, the side-by-side presentation breaks the flow of reading as you have to move your eyes around the screen all the time. And, finally, it’s Markdown. While Markdown is nice at what it does, it’s very limited, and I missed the extra power of something like LaTeX.

The main difficulty, though, remains the editing environment. Without special support, while editing the comments show up as just comments. I can never remember the order of brackets in Markdown links — I rely on syntax highlighting to tell me that I have it correct.

Is there a way that Clojure and LaTeX can be made to work together?

My first thought, in experimenting with LaTeX was a remarkably cheap and cheerful one. Consider a document such as this:

 ;; \documentclass{article} ;; \begin{document} ;; \begin{code} (println "hello world") ;; \end{code} ;; \end{document}

This is a valid Clojure file, and is nearly valid latex as well. The only illegal part is that ;; occurs before the documentclass macro, although, in practice having ;; appear randomly throughout the document would not be ideal either.

Now, LaTeX as an embedded markup language has a very plastic syntax, and I have used that in this case. It is actually very easy to just ignore the ; character entirely, through the use of Catcodes; we can put this into a driver file which then inputs our Clojure file like so:

 \catcode;=9 \input{file.clj}

This way we maintain the validity of our Clojure file (otherwise the first line would be illegal). This is a remarkably cheap and cheerful way of achieving our aims; albeit at the cost of losing the ability to use semi-colons in our writing.

## Indirect-buffers

What, however, about the editing environment. My own preferred environment — Emacs — has nice modes which edit both LaTeX and Clojure code, and it is possible to switch between the two, when I want to move between editing code and editing documentation. This is quite clunky, but there is a second option which is “indirect-buffers”. This is a piece of Emacs arcana where two buffers share some of the same data structures but not all, which means that they can have different modes. Unfortunately, my experience is that the buffers share too much — as well as the text, they also share “text-properties” which unfortunately both LaTeX and Clojure mode use. In practice, this means syntax highlighting fails (or rather than two representations fight with each other). As a second problem, although the file is valid LaTeX it is not normal LaTeX; simply things like wrapping text in paragraphs fails because of the ;; comments at the beginning of each line.

So, this experiment fails the editing environment test.

My next attempt was to use block comments. Consider this file which is valid lisp using #| and |# block comments.

 #| \documentclass{article} \begin{document} \begin{code} |# (println "hello world") #| \end{code} \end{document} |#

We can use a similar (but not identical) trick with catcodes to make this valid latex also:

 \catcode#=\active \def#{\catcode#=6} \catcode|=\active \def|{\catcode|=12} \input{hello_world.lisp}

The first call makes the # character active — that is, definable as a macro. We then define # as a macro which will set the catcode of # to 6 (which is it’s default). Then, we do the same with |. The practical upshot of this is that the opening #| does nothing other than reset everything in the driver file; effectively it’s ignored.

This actually works quite nicely in the editing environment; the opening #| effectively makes no difference to Emacs, and the mode works well. The only real disadvantage is that every code block needs two delimiters — one to open the code block in latex, and end the comment in Lisp.

Now there are various multi-mode tools around for Emacs which should help solve the otherwise clunky editing environment, although even here I am not convinced that this is the right route. Multi-mode tools are complex and to some extent are not what I want — when editing code I want to suppress the documentation, give it a low visual immediacy, and when editing documentation, I want the reverse.

There is, however, a bigger problem — while the last example is valid Common Lisp, Clojure does not have block comments, nor does the programmer have the ability to extend the reader in this way. So, while this seems a nice solution, it depends on a specific language feature which Clojure lacks.

## Emacs Experiments: formats

My next idea was to use formats. Emacs allows transformations to happen between the text that is visualised on screen, and how it is saved to file. The main reason for this is to support the many non-ascii text formats that exist. But it is (perhaps unsurprisingly) fully extensible within Emacs and could be used for any purpose. So, why not convert line-commented Clojure on file into block-comments on screen; this will give editable latex on screen and valid Clojure on file and a driver file to give valid Latex on file also.

Unfortunately, it fails. While Emacs latex support is file based, Clojure (and specifically cider) has a tighter integration; it can communicate the contents of a buffer without saving to file. This circumvents the formatting — the block comments are sent to Clojure process which complains.

I am now experimenting with another option. indirect-buffers place exactly the same text (and text-properties) into two buffers. Instead of sharing all the text, why not have two buffers with a function that can transform the text bi-directionally between the two. The practical result is two views over the same content. Surprisingly, this works pretty well, as you can see here even though my current implementation is very simple — the whole buffer is copied every keypress. We could achieve the same thing with indirect-buffers, but as well as simple copying, however, we can also transform the text on the fly so that both buffers are valid for their respective modes.

The broad idea is not that new — it’s similar to web/weave or SWeave for instance, except that it is embedded into the editor; this means we can take advantage of existing support for both languages from the editor and the author gets immediate feedback about the transformation — so messing up the syntax is pretty obvious.

It also provides a superset of functionality provided by other techniques: indirect-buffers as mentioned previously, shadowfile.el (which creates a second copy of a file somewhere else on every save), and it could also mimic shadow.el which generates a secondary file by a command invocation on every save (although an invocation of an external command every keypress would probably not be performant).

The first release of linked-buffer was a month ago. I am currently unhappy with the configuration and will change this so code is in flux at the moment, but I am using it in anger which is a good sign. Currently, it does a latex <-> clojure transformation, but I will add a few more as time goes on.

## Discussion

It has taken me quite a while to get to this stage, and a number of experiments along the way, but my feeling is that I now have a workable literate environment. It also validates my decision to build Tawny . Having a rich textual language for building ontologies is a bit of a game changer; providing programmatic extensions to the language has been helpful, but the access to other tools, git, travis, tests and a repl has really made the difference. Now adding a literate environment to this as well changes the way that I can use ontologies and is a paradigm shift in their development.

## Bibliography

### Tawny 1.0

I am please to announce the first full release of Tawny-OWL, my library for fully programmatic development of OWL ontologies. The library now has a fairly large feature set:

• Complete support for OWL2

• Integrated support for reasoning with HermiT or ELK

• Profile checking

• Fixtures and support macros for unit testing

• Use of external ontologies available only as OWL files

• Rendering of OWL API objects to Tawny code.

• Support for generating and using ontologies with numeric IDs.

• Support for multilingual labels.

Additionally, I now have initial integration with Protege, described later.

The library is now available from clojars or on github.

• Background

A little over a year ago, I first described my experiments with building a programmatic environment for ontology construction . The need arose out of frustration with existing ontology tools; Protege, for example, provides a nice graphical environment, but it has many limitations. It does not easily allow automated generation of ontology entities, for example, and it also does not provide access to tools which are common place in an IDE: versioning tools, diffs, test cases and so forth. While ontology specific variations of these tools do exist, they were not as good as the ones I was used to use when programming.

Tawny seeks to bridge this gap, by using a full programmatic environment to generate OWL ontologies. I chose Clojure because of its syntactic plasticity; at its simplest, when using tawny, it does not feel like a programming language, just a syntax and evaluation engine for writing ontologies. However, the full power of the programming language is there and can be used when necessary .

Since the first blog post, there have now been a further 8, as well as three papers, describing tawny itself , the karyotype ontology and our use of patterns, higher-level abstractions within the karyotype ontology and applied to SIO. From an initial experiment, tawny has become a useful tool which we are using on a daily basis.

• In Early Release

Included in this release is our initial integration with Protege. Tawny builds on the OWL API, which is also the basis for Protege. I always assumed that Protege would be used to view OWL files generated by Tawny, but it is actually possible to integrate them much more comprehensively than this. It is now possible to directly manipulate the data structures of Protege using Tawny; in short, Protege can display what ever tawny has generated immediately, and without a file in between.

We have achieved this in two ways; firstly with protege-tawny which provides a command line environment directly inside Protege. This is useful, but does not provide the rich programmatic IDE that I want. However, the protege-nrepl environment allows exactly this; protege launches Clojure, and launches a NREPL server to which you connect with Emacs, Eclipse or any of the other Clojure IDEs. Finally, lein-sync allows syncing classpaths and dependencies with an existing Clojure project. The practical upshot can be seen in screencast; tawny can be used as normal, with Protege following.

Currently, Protege and Tawny use different versions of the OWL API, so while the protege-nrepl can be used with the current release of Protege, periodic crashes happen. In the meantime, a hand-built distribution of Protege is available including nrepl.

• For the Future

I have three main aims for the next few releases of Tawny. First, we need to provide access to explanation code; currently, this has to be accessed within Protege, which is less than ideal for a process than can take many minutes to run. I wish to integrate this with the Clojure unit test environment so that explanations will be generated by failing test cases.

Second, tawny currently allows the development of ontologies, but does not allow easy querying over them. I have several possibilities here: including integration of a SPARQL engine; the current rendering engine, combined perhaps with core.match, or, finally, fully-fledged support for core.match directly.

Finally, I wish to experiment with and add support for connection points , to better enable modular ontology development.

## Bibliography

### Ontology Connection Points

In this post, I will describe what I call connection points and explain how they can be used to enable modularity and overcome problems with scalability of reasoning in OWL.

One of the recurrent problems with building ontologies is mission creep; what starts simple rapidly expands until many different areas of the world are described.

I faced this problem recently, when I was asked about the axiomatisation that I described in my paper about function . Well, the axiomatisation exists, but it was never very complete; so, I thought I should redo it, probably with Tawny-OWL .

To start off with a simple declaration of function, we might choose something like this:

 (defclass Function :subclass (only realisedIn nProcess))

Or, in rough English, a function is something that displays itself only when involved in a process (the n in nProcess is to avoid a name clash). Now, immediately, we hit the mission-creep problem. Traditionally, functions have been considered to be some strain of continuant, and so it might be expected that we would only need to describe classes that are continuants to define a function. And, yet, straight away, we have a process. To make this definition meaningful, we need to distinguish between processes and everything else, and pretty quickly, our ontology of function requires most of an upper ontology.

This has important consequences. First, if the upper ontology in use is any size at all, or alternatively has a complex axiomatisation, then immediately a lot of axioms have to be reasoned over, and this can take considerable time.

Second, and probably more importantly, the choice of an upper ontology can be quite divisive. We have argued that a single representation for knowledge is neither plausible nor desirable  — this limits the ability to abstract, meaning that all of the complexity has to be dealt with all of the time; in essence, an extreme example of mission creep. If, for example, BFO is used, then the representation of entities whose existence we are unsure about becomes difficult. Conversely, if SIO is used, uncertain objects come regardless.

In the rest of this post, I will describe the how we can use the OWL import mechanism to define what I term connection points to work around this problem.

## Identifiers and Imports

One of the interesting things about OWL is that, as a web based system, it uses global identifiers in the form of IRIs (or URIs, or URLs, as you wish); I can make statements about your concepts, you can make statements about mine. However, not all OWL ontologies share the same axiom space; this is controlled explicitly, through the OWL import mechanism. In short, while you can make statements about my ontology, I do not have to listen. The practical upshot of this is that it is possible to share identifiers between two ontologies without sharing any axioms, or to share axioms in one direction only.

One nice use of this is with a little upper ontology that I built mostly to try out Tawny, called tawny.upper. This comes in two forms, one in EL profile, and one in DL; the latter has more semantics but is slower to reason over. The DL version imports the EL version but, unusually, introduces no new identifiers at all, it just refines the terms in the EL version with the desired additional semantics. Downstream users can switch between EL and DL semantics by simply adding or removing an OWL import statement.

## Alternative forms of import

The ability to share identifiers but not axioms has been used by others, as it provides a partial solution to the problem of big imports. MIREOT , for example, defines an alternative import mechanism. MIREOT is described as a minimal information standard ; in this it is rather simple, as the minimal information required to reference (identify) an ontology term its identifier and that of its ontology. In practice MIREOT is a set of tools that, at its simplest, involves sharing just the identifier and not the semantics. This can help to reduce the size of an ontology significantly.

An extreme use-case for this would be in our karyotype ontology ; if we wished “human” to refer to the NCBI taxonomy, we could import 100,000s of classes to use one, increasing the size of the ontology by several orders of magnitude in the process. One solution is to just use the identifier and not owl import the NCBI taxonomy.

However, this causes two problems. First, following our example we can no longer infer that, for example, a Human karyotype is a Mammalian karyotype; these semantics are present only in the NCBI taxonomy, and we must import its semantics if we wish to know this; similarly, we would be free to state that, for example, a human karyotype was also a fly karyotype. The second problem is that, in tools like Protege, the terms becomes unidentifiable, because the rdfs:label for the term has not been imported, and the NCBI taxonomy uses numeric identifiers.

The MIREOT solution is to extract a subset of the axioms in the upstream ontology, and then import these; obvious subsets would be all the labels of terms used in a downstream ontology, although MIREOT uses a slightly more complex system . This would solve the problem of terms being unidentifiable; still, though, human would not be known to be mammalian. Another subset would be all terms from mammal downwords (with their labels). Now, human would be known to be a mammal, but not known to not be a fly. As you increase the size of the subset, you increase the inferences that you can make, but the reasoning process will get slower.

From my perspective, the second of these seems sensible; large ontologies reason slowly and there is no way around this, until reasoner technology gets better. For this reason, I will probably implement something similar in tawny (with an improvement suggested later). The first, however, seems less justified. We are effectively duplicating all the labels in the upstream ontology, with all this entails, for the purpose of display; we can minimise these problems, by regularly regenerating the imported subset from the source ontology regularly, but this is another task that needs to be done.

Tawny is less affected by this from the start, since the name that a developer uses can exist only in Clojure space; more over, when displaying documentation, tawny can use data from any ontologies, rather than those imported into the current ontology. We do not need to duplicate the MIREOT subset, we just need to know about it.

## Connection Points

While MIREOT is a sensible idea, it is nonetheless seen as a workaround, a compromise solution to a difficult problem . However, in this section, I will discuss a simpler, and more general solution that helps to address the problem of modularity.

Consider, a reworked version of the definition above, with one critical change. The nProcess term is now referencing an independent Clojure namespace. The generated OWL from this ontology will include nProcess simply as a reference.

 (defclass Function :subclass (only realisedIn connection.upper/nProcess))

This is different from the MIREOT approach which maintains that the minimal information is the identifier for the term and the identifier for the ontology. In this case, we only have the former. This difference is important, as I will describe later.

In one sense, we have achieved something negative. We now have a term in our function ontology, with no semantics or annotations. Oops has this in their catalogue of ontology errors:

P8. Missing annotations: ontology terms lack annotations properties. This kind of properties improves the ontology understanding and usability from a user point of view.

— OOPS

However, this problem can be fixed by the editing environment; and, indeed, using Tawny it is. We have a meaningful name, despite a meaningless identifier, and we can see the definition of nProcess should we choose. I call these form of references connectors, and they have interesting properties. In this case, using nProcess is a required connector. The function ontology needs it to have its full semantic meaning, but it is not provided.

So, let us consider how we might use these connection points. First, for this example, we need a small upper ontology; in this case, I use the simplest possible ontology to demonstrate my point.

 (defontology upper) (as-disjoint (defclass nProcess) (defclass NotProcess))

Now, considering our function definition earlier; imagine that we wish to use this in a downstream ontology to define some functions. In this case, we define a child of Function which is realisedIn something which is NotProcess. The simplest possible way of doing this is to use all three of the entities (Function, realisedIn and NotProcess) as required connection points. We import no other ontologies here, so we can infer nothing that is not already stated.

 (defontology use-one) (defclass FunctionChild :subclass connection.function/Function (owl-some connection.function/realisedIn connection.upper/NotProcess))

In our second use, we now import our function ontology. At this point, the value of the shared identifier space starts to show its value; we now understand the semantics of our Function term because it uses the same identifier as the term in the function ontology.

This does, now, allow us to draw an additional inference; any individual of FunctionChild must be realisedIn an instance of NotProcess which, itself, we can infer to be a child of Process because the function ontology claims this. Or, in short, NotProcess and Process cannot be disjoint, if our ontology is to remain coherent. This ontology remains coherent, however, because we have not imported the upper ontology.

 (defontology use-two) (owl-import connection.function/function) ;; this ontology looks much the same as use-one (defclass FunctionChild :subclass connection.function/Function (owl-some connection.function/realisedIn connection.upper/NotProcess))

In the final use, we import both ontologies. The function import allows us to conclude that NotProcess and Process cannot be disjoint, while out upper ontology tells us that they are, and at this point, our ontology becomes incoherent. The required connection point in the function ontology has now been provided by term in our upper ontology.

 (defontology use-three) (owl-import connection.function/function) (owl-import connection.upper/upper) (defclass FunctionChild :subclass connection.function/Function (owl-some connection.function/realisedIn connection.upper/NotProcess))`

The critical point is that while the function ontology references some term in its definition, the exact semantics of that term are not specified. These semantics are at the option of the downstream user of function ontology; in use-three, we have decided to fully specify these semantics. But we could have imported a totally different upper ontology had we chosen, either using the same identifiers, or through a bridge ontology making judicious use of equivalent/sameAs declarations. In short, the semantics has become late binding.

We can use this technique to improve on MIREOT. Instead of importing our derived ontology, we can now use connection points instead. The karyotype ontology can reference the NCBI taxonomy, and leave the end user to choose the semantics they need; if the user wants the whole taxonomy, and is prepared to deal with the reasoning speed, then have this option. This choice can even be made contextually; for example, an OWL import could be added on a continuous integration platform when reasoning time is less important, but not during development or interactive testing.

## Future Work

While the idea of connection points seems sound, it has some difficulties; one obvious problem is that the developer of an ontology must choose the modules, with connection points for themselves. We plan to test this using SIO; we have already been working on a tawnyified version of this, to enable investigation of pattern-driven ontology development. We will build on this work by attempting to modularise the ontology, with connection points between them.

Currently, the use of this form of connection points adds some load to the downstream ontology developer. It would be relatively easy for a developer to build an ontology like use-one or use-two above by mistake, accidentally forgetting to add an OWL import. Originally, when I built tawny, I wanted to automate this process — a Clojure import would mean an OWL import, but decided against it; obviously this was a good thing as it allows the use of connection points. I think we can work around this by adding formal support for connection points, so that for example, the function ontology can declare that nProcess needs to be defined somewhere, and to issue warnings if it it is not.

## Conclusions

In this post, I have addressed the problem of ontology modularity and described the use of connection points, enabling a form of late binding. In essence, we achieve this by building on OWLs web nature — shared identifiers do not presuppose shared semantics, in different ontologies. While further investigation is needed, this could change the nature of ontology engineering, allowing a more modular, more scalable and more pragmatic form of development.

## Bibliography

### Kblog-Include

I have finally got around to releasing kblog-include, a plugin that I first alluded sometime previously . This plugin allows WordPress to transclude content from arXiv and potentially any OAI-PMH repository. When used from arXiv the date, authors and title are set (and advertised if kblog-metadata is installed also), and the abstract is added in place.

I’ve been using this for sometime now; in fact my last article is an example.

Feedback welcome, as always.

## Abstract

A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge, during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over 100, 000 sentences. Over 8000 sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately 30% are erroneous, whilst 35% appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website.

• Michael J. Bell
• Matthew Collison
• Phillip Lord

## Plain English Summary

There are many database resources which describe biological entities such as proteins, and genes available to the researcher. These are used by both biologists and medics to understand how biological systems work which has implications for many areas. These databases store information of various sorts, called annotation: some of this is highly organised or structured knowledge; some is free text, written in English.

The quantity of this material available means that having a computation method to check the annotation is desirable. The structured knowledge is easier to check because it is organised. The free text knowledge is much harder.

Most methods of analysing free text are based around “normal” English; biological annotation uses a highly specialised form of English, heavily controlled and with many jargon words. In this paper, we exploit this specialised form to infer provenance, to understand when sentences were first added to the database, and how they change over time. By analysing these patterns of provenance, we were able to identify patterns which are indicative of inconsistency or erroneous annotation.

### Tawny 0.12

I am pleased to announce the release of tawny-owl, Version 0.12.

This package allows users to construct OWL ontologies in a fully programmatic environment, namely Clojure. This means the user can take advantage of programmatic language to automate and abstract the ontology over the development process; also, rather than requiring the creation of ontology specific development environments, a normal programming IDE can be used; finally, a human readable text format means that we can integrate with the standard tooling for versioning and distributed development.

OWL is a W3C standard ontology representation language; an ontology is a fully computable set of statements, describing the things and their relationships. These statements can be reasoned over, inferences made and contradictions detected automatically using an off-the-shelf reasoner.

0.12 is planned to be the final, feature complete release before the 1.0 release. New features will not be added before 1.0. Key new features in 0.12 are:

• Complete support for OWL 2, include data types
• OWL documentation can be queries as normal clojure metadata
• New namespaces, query and fixture
• Completion of rendering functionality
• Regularisation of interfaces: where relevant functions now take an ontology as the first argument.
• Updated to Hermit 1.3.7.3, OWL API 3.4.5

Tawny is available at https://github.com/phillord/tawny-owl, or as a maven artifact from http://clojars.org. The development of tawny-owl is documented in my journal at http://www.russet.org.uk/blog/category/all/professional/tech/tawny-owl