At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.

In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.

While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.

OBO makes use of a numerical, semantics-free identifier and a namespace, with a syntax of NAMESPACE:IDENTIFER. So, a Gene Ontology term looks like GO:0003674. The namespace is not constrained to be two-letters and has mechanisms for world-uniqueness, in that people talk to each other and sort it out, if they clash. The use of a semantics-free identifier means that term names can be changed while maintaining the implied meaning with the term; the label for the term, meanwhile, provides a human readable version, which can be shown to users of the ontology. I will call these the OBO identifier and OBO label respectively.

Translating this, however, into OWL, including Manchester syntax causes significant problems. The naturalistic translation is to turn the OBO identifier onto the identifier in OWL; the OBO namespace would become an XML namespace, the OBO identifier would become an XML identifier. Unfortunately, this doesn’t work. First, the OBO identifier is genuniely just a short string and XML requires a URI; so a mapping between OBO identifiers and URIs is necessary. Second, the OBO identifier is numerical; unfortunately, while the identifiers in OWL can contain numbers they have to start with a non-numerical character. The standard translation, therefore, uses in most cases an OBO wide URL (http://purl.obolibrary.org/obo/), although some ontologies have their own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO identifier is mapping to an valid identifer by sticking a prefix onto the numbers. So, we have identifiers such as GO:GO_0042101 or obo:OBI_1110045. There are also some OBO ontologies for which this does NOT occur; for instance, BFO classes in OBI come out with identifiers of the form snap:Continuant or span:Process, except for one which is bfo:Entity.

Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:

Class: obo:OBI_1110161

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:

[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity

The cross reference in this case is a subsumption link to IAO:0000030

One solution would be to use the rdfs:label in place of the identifier. So, we would have something that looked like this:

Class: "T cell epitope ELISA IL-1b assay" @en

    Annotations:
        obo:identifier "1110161"

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

Other identifiers would also have to be changed, also. I’ve also added the odo:identifier line (which I think would be valid, but might require the creation of an OWL individual). Without this, it would not be possible to go backward.

However, this is problematic as it changes the serializiation between the OWL Manchester syntax and other syntaxes of OWL. The class identifier has to be URI legal, and OBO label here is not. We could do a syntactic conversion (e.g. T%20%cell%20%epitope) but this, again, reduces readiblity, defeating the point. Also, the rdfs:label would become part of the final identifier URI, which then becomes a semantics heavy identifier. Finally, it would require a OBO specific loading of the Manchester syntax, taking the URI identifier from the annotation block, and the rdfs:label from the class name.

So, is there any solution. First, there are tooling solutions. In Protege, it is already possible to use any component of the definition in the display. So, you can set the rdfs:label as the main display form. Tooling solutions are attractive, but there is a problem; you have to extend all tools to support this view; I realise that the number of freaks who wish to edit OWL with emacs is not that large, so this might not seem an issue. However, many people wish to develop ontologies collaboratively using version control; if you want to compare versions you use diff, so we now need an Manchester syntax diff viewer. Also, if you want to do some perl hacking, or straight-forward search and replace, again, it’s all harder.

To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.

So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:

Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:

Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
        obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
        some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
        and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
             some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
             ))

This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.

So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.

Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661 [blah],
        obo:OBI_0000299 [longer blah]
        some (obo:IAO_0000109 [more]
        and (obo:IAO_0000136 [stuff]
        some obo:OBI_11101 [OBI Thing]
        ))

This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.

Comments welcome!