Archive for the ‘Tech’ Category

At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.

In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.

While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.

OBO makes use of a numerical, semantics-free identifier and a namespace, with a syntax of NAMESPACE:IDENTIFER. So, a Gene Ontology term looks like GO:0003674. The namespace is not constrained to be two-letters and has mechanisms for world-uniqueness, in that people talk to each other and sort it out, if they clash. The use of a semantics-free identifier means that term names can be changed while maintaining the implied meaning with the term; the label for the term, meanwhile, provides a human readable version, which can be shown to users of the ontology. I will call these the OBO identifier and OBO label respectively.

Translating this, however, into OWL, including Manchester syntax causes significant problems. The naturalistic translation is to turn the OBO identifier onto the identifier in OWL; the OBO namespace would become an XML namespace, the OBO identifier would become an XML identifier. Unfortunately, this doesn’t work. First, the OBO identifier is genuniely just a short string and XML requires a URI; so a mapping between OBO identifiers and URIs is necessary. Second, the OBO identifier is numerical; unfortunately, while the identifiers in OWL can contain numbers they have to start with a non-numerical character. The standard translation, therefore, uses in most cases an OBO wide URL (http://purl.obolibrary.org/obo/), although some ontologies have their own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO identifier is mapping to an valid identifer by sticking a prefix onto the numbers. So, we have identifiers such as GO:GO_0042101 or obo:OBI_1110045. There are also some OBO ontologies for which this does NOT occur; for instance, BFO classes in OBI come out with identifiers of the form snap:Continuant or span:Process, except for one which is bfo:Entity.

Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:


Class: obo:OBI_1110161

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:


[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity

The cross reference in this case is a subsumption link to IAO:0000030

One solution would be to use the rdfs:label in place of the identifier. So, we would have something that looked like this:


Class: "T cell epitope ELISA IL-1b assay" @en

    Annotations:
        obo:identifier "1110161"

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

Other identifiers would also have to be changed, also. I’ve also added the odo:identifier line (which I think would be valid, but might require the creation of an OWL individual). Without this, it would not be possible to go backward.

However, this is problematic as it changes the serializiation between the OWL Manchester syntax and other syntaxes of OWL. The class identifier has to be URI legal, and OBO label here is not. We could do a syntactic conversion (e.g. T%20%cell%20%epitope) but this, again, reduces readiblity, defeating the point. Also, the rdfs:label would become part of the final identifier URI, which then becomes a semantics heavy identifier. Finally, it would require a OBO specific loading of the Manchester syntax, taking the URI identifier from the annotation block, and the rdfs:label from the class name.

So, is there any solution. First, there are tooling solutions. In Protege, it is already possible to use any component of the definition in the display. So, you can set the rdfs:label as the main display form. Tooling solutions are attractive, but there is a problem; you have to extend all tools to support this view; I realise that the number of freaks who wish to edit OWL with emacs is not that large, so this might not seem an issue. However, many people wish to develop ontologies collaboratively using version control; if you want to compare versions you use diff, so we now need an Manchester syntax diff viewer. Also, if you want to do some perl hacking, or straight-forward search and replace, again, it’s all harder.

To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.

So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
        obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
        some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
        and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
             some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
             ))

This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.

So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.


Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661 [blah],
        obo:OBI_0000299 [longer blah]
        some (obo:IAO_0000109 [more]
        and (obo:IAO_0000136 [stuff]
        some obo:OBI_11101 [OBI Thing]
        ))

This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.

Comments welcome!

Make has been driving me mad for the last week. It keeps on complaining about “modification time in the future”. Normally, this happens because you’re using rmeote files from a server which doesn’t have sync’d time. But this is rare these days. Anyway, it was complain that the file was 10E+06 seconds in the future; that’s a really, really big clock skew.

Did a bit of poking around. One possibility I found was that it was due to a limitation in FAT32; hmmm, not likely. Didn’t have time for more. I am at a conference; supposed to be paying some attention.

Anyway, the solution came to me today. Or rather the cause, because the solution was obvious. Turns up, when I changed timezone to Czech, I pushed the month back to August. What I don’t understand is that I was sure windows synced to a NTP server running somewhere. What does it do when you change the month?

This year, our clusters are going to be moved over to Vista, so I’ve decided to downgrade my windows box from XP to vista. It’s been an inevitable fun-filled afternoon as a result.

Tried a remote installation to save the effort of finding disks. Unfortunately, we tried an installation which booted into Windows 7, and then allowed you to install vista from there; this results in a mysterious 100M partition for use with bitlocker; vista doesn’t know about this, so mounted it as D drive and, as it’s marked as a system partition, you can’t change this. Three installations later, it was gone, and Vista is installed.

Next up, install synergy. Turns out that this is hosed because of UAC — the Vista access control. How to Geek was very helpful, although their technique doesn’t completely work. I have some ideas, but basically, had to turn off all UAC elevation dialogs (as synergy doesn’t work then, which rather defeats the point), and I have to start it by hand every login. At this juncture, a hardware KVM seems an option, but it’s clunky in comparison to synergy.

Cygwin installation has been okay, except for some mysterious “Program Compatibility” dialog which tells me that I have done things wrong and offers to make my life better. Next up is the problem of getting security permissions on my files on D, which think that they are owned by another user (from my old OS). Normal Windows problem I can’t get the permissions set up, or percolating downward whatever I do.

(At this point, a friend popped in and said, “Why don’t you install Windows 7 instead”. Not the first to ask).

Think I now have the security permissions set, although it’s going to take about 2 hours to find out for sure, as it traverses my file system. Cygwin appears to have another strange problem where a bash window doesn’t respond to a click—if you want to move it from the front, you have to use the taskbar.

Emacs, skype, miktex all seem to have installed okay; neither webcam nor sound drivers worked in the default installation, but vista did manage to find them, so no complaints there really. I’ve also found one major advantage; when you switch the irritating desktop sounds off, windows no longer asks you whether you want to save the old scheme (yes, being the default); well worth the billions of dollars spent on vista. The machine balked after all these installs, with explorer up to 100% CPU. Restart has solved.

Installing cygwin sshd was a bit hard; the trick is to run ssh-host-config in a cygwin.bat run as administrator. It all works fine then, except for the bit where you try to ssh in to the machine. Then you always get Connection Closed. Giving up for now.

How would I have got this far out with the wonderous Gerry Tomlinson to help me out? No idea.

On the flip side, thought, I was interested to see one of my own great ideas, first expounded in my work on Generating Sewage Systems has been taken up the Institute of Mechanical Engineering, in a report which has even got as far as the BBC. Yep, algae reactors down the side of buildings. It’s the way forward.

I’ve generally been reasonably impressed with wordpress since I moved to it from my old, emacs-driven system. It seems to work mostly and it’s reasonably easy to manage.

One problem has been the regularity of the updates; worse, they all tend to be security updates (2.8.4 was to correct a problem where a crafted URI allowed overwrite of the admin password). So, you have to update. Often.

Fortunately, wordpress provides an automatic mechnism for achieving this. Less fortunately, it doesn’t work for me. We’ve finally pinned down why, which is too tedious to explain, but I don’t like the mechanism anyway, as I have to give wordpress my username/password (for the command line, not for wordpress).

So, I’m trying another solution. Check the whole thing out of SVN. I’ve just moved over to this mechanism for the 2.8.4 upgrade and it seems to work. This is actually the same amount of effort as a regular manual upgrade; you just svn co rather than wget/unzip. In future, it should me much easier, though. Just a simple svn switch. No fiddling with moving wp-config across, and wp-content should be unaffected. Even better the one hack that I have had to apply to formatting.php every time should be automatically merged in, or will conflict — in which case, it will good to be warned.

I’ll post again in a few updates time if it all works; if this blog suddenly goes offline, well, probably this wasn’t such a good idea after all.

Following my holiday, I’ve decided to create two new categories for my blog, one for all my professional pieces and one for my personal.

This blog fulfils two many purposes. Firstly, it serves as a memory aid for myself; I can look back at the things and the ideas that I’ve had in the past. Secondly, I use it to publish these ideas. I’m aware that the former is the more important than the latter; like most blogs, this site is not heavy traffic.

I do publish about my personal life here, but this is not a full disclosure blog; it’s called “an Exercise in Irrelevance” for exactly this reason. I put occasional reviews of things up; places I’ve visited or music that I’ve listened to. All about my reactions to public events. This blog isn’t meant to be a soap opera.

I also publish posts about my work here. I think, over time, these will become more important; recently, I’ve been the blog as lab book but I think it will also start to become a more formal publication route.

Given this, I think it makes sense to separate the two strands, to enable the few subscribers that I have to choose whether to read about my life outside science or not. Personal, Professional or Everything, the choice is yours.

I think I now have my blogging environment as I want it. I’ve been using blogpost.py to do my posting. I couldn’t let go of my text only environment. I don’t care if it’s old fashioned, but I like the separation of editing and viewing. In this case, I’ve even had to learn asciidoc, but it was worth the effort.

Today, I think I have fiddled with blogpost.py for the last time. I can now set both categories and status (published or unpublished) from within the blogfile. I’d added a post command previously; originally, blogpost used to have a create and update command.

The big advantage with this is that all the information about the blog is apparent from the file; this means I can use a single make file to compile the lot. Any changes that I make while on the road will automatically publish to the web when I get online again. I can even put a catch-up in my backfile to make sure everything is up-to-date.

Okay, so I am sad; so sue me.

Blogs are generally seen as a slightly dubious part of the scientific publishing landscape. This is not, of course, unreasonable. I put stuff up here, for example, such as my idea for IDs that I’ve thought about for a few days, but that I am unlikely to follow any further, or stuff opinion pieces on bees about which I have as little expertise as the average journalist.

Fundamentally, though, despite it’s current use, a blog is just a media channel; you can use them to transfer anything you like. A scientific paper, for instance. This might be useful. While, for instance, I love open access publication, it’s quite expensive particularly as the cash tends to come out of my own budget, at least until I can get the library to pay.

So, I’ve been thinking about a cheap and cheerful blog-based system. It would work like this. The author would simply publish their paper onto their own blog. Next, they would send a request (using one of these pingback or trackback thingies that I haven’t worked out yet) to a “journal” which would also be a blog, in this case a private one. The editor would then invite comments from willing reviewers using same technique. Reviewers could then read the blog post, comment on it using their own blog. After the normal revision cycle, the editor would make a decision. If it was accepted, the authors blog post would be linked from the journals main feed (probably grabbing an archival copy at the same time). If it was not accepted, the author could try another journal, this time with initial reviews in-hand; the process would not beed to be reiterated.

This would have several advantages over the current system. Formatting and presentational problems would disappear because they would be controlled by the authors. Prepublication would become unnecessary, because submission and publication would become the same thing. The role of the journal would be limited to what they are best at; getting reviewers in and rubber stamping a seal of approval on worthy papers. Finally, the tireless work of reviewers would be publically acknowledged; their own blogs would have a record of every review that they have ever done.

All the technology for this already exists; it just needs some social conventions layering on top.

Ah, it does on and on. After my last attempt at literate OWL programming, called omnsplit, I decided that there was a problem; this version splits the OWL file into individual statements, and puts them into files with the same name as the OWL class (property, or whatever).

The problem is that, for an ontology like OBI, you get 1400 individual files; this is just inconvienient as many applications don’t like this many files in a directory. Also, there is a naming constraint; you can only use characters legal in the file system; this doesn’t include “:” if you want to be Windows (NTFS) compliant.

So, for my new system, I decided to generate an index file, which just points at locations in the ontology file. Initially, I was just going to index the main ontology file; in the end, I decided a partial copy was the way forward; generating both the index and indexed file ensure that they will stay in-sync.

It required a bit of nasty latex hacking; the basic problem was avoiding the limitation of being only able to use legal LaTeX macro characters (that is letters). The system now works like this:



%% This is generated by python which also generates the
%% function_ont.spt file which is a copy of the ontology (with a
%% few new lines gone.

%% This just defines a new macro in what appears to be an
%% unnecessarily complex way.
\expandafter\def\csname OmnEntityHeaderheader\endcsname%
{\lstinputlisting[language=omn,firstline=1,lastline=8]{function_ont.spt}}

%% But the use of \expandafter and \csname means that you can
%% use any character you like, including underscores and numbers
%% in the macro name.
\expandafter\def\csname OmnEntityObjectPropertyhas_role\endcsname%
{\lstinputlisting[language=omn,firstline=206,lastline=219]{function_ont.spt}}

%% We can now define two commands in the style file. Again
%% we use \csname so that we are not bound to characters legal
%% in latex macros.
\newcommand{\omnclass}[2]{\csname OmnEntityClass#1#2\endcsname}
\newcommand{\omnobjprop}[2]{\csname OmnEntityObjectProperty#1#2\endcsname}

%% now in our source, we can do things like this.
\omnobjprop{}{has_role}

Using an index in this way also has another advantage. I’ve had to make a decision whether to go with rdfs:label or the entity name. I can now back out of this; I can just use both in the index file, without too much extra space, so that either would be referencable within the latex.

To me, this feels like the right solution. It’s relatively simple (with a bit of nasty latex, which is nicely hidden), it doesn’t depend on the file system. It needs a bit more work to bring it to completion, but not that much.

Sadly bio-ontologies looms, so next week will be getting ready for that; perhaps I can finish this off on the way back. “Sadly” is perhaps a poor choice of words; I’m greatly looking forward to it, but I’ve kind of had the bit between my teeth with python and latex hacking for the last few weeks.

Just upgraded to Wordpress 2.8. The automatic update didn’t work; this seems to be a continual problem which stems from wordpress not being in the default location. For some reason, it wants to push from the new version rather than pull under these circumstances. Not good.

So, I did the manual upgrade; unfortunately the admin page crashed out with an error:

PHP Fatal error: Call to a member function read() on a non-object in wp-includes/theme.php on line 387

This has been reported here and here

It’s this bit of code causing the problems.


$template_dir = @ dir("$theme_root/$template");
                if ( $template_dir ) {
                        while ( ($file = $template_dir->read()) !== false ) {
// etc

It appeared to be only be my modified version of the theme (Evanesence) causing the problem; it’s not very modified, so I removed them one by one. For no readily apparent reason the problem appears to be a subdirectory called “images.old”. Surely, not a good reason for a crash.

Weird and wonderful.

After a bit of struggle, I now have another literate OWL tool working, along the lines discussed in a previous blog post. Rather than generating the OWL documentation, I now split a Manchester syntax file up, so that I can refer to bits of it. I have this working with OBI, using Protege to produce a single merged ontology file, in Manchester syntax.

The current implementation is rather simple; it produces one file-per-entity in the OWL file which I don’t think is entirely good. When run on OBI, it creates over 1400 files which is a lot. The other problem is that I’ve had to do some dubious hacking to get the file names work out. Firstly, I have to remove spaces and “\”’s, as wel as “:” which is illegal on NTFS.

There’s also a problem with some of the OWL. Unfortunately, the OBI to OWL conversion process has a reification step which I don’t quite understand the purpose of. This comes out as this sort of anonymous individual. I’m not sure at all how the definition has come out as the rdfs:label, but, for sure, you can’t use this as a filename!


Individual: relationship:genid7

    Annotations:
        rdfs:label "C located_in C' if and only if: given any c that
instantiates C at a time t, there is some c' such that: c' instantiates
C' at time t and c *located_in* c'. (Here *located_in* is the
instance-level location relation.)"@en,
        oboInOwl:hasDbXref relationship:genid8

    Types:
        oboInOwl:Definition

I think I might change the implementation a bit, though. Having 1400 files in one directory is not good. My idea is to serialize the entire file out as latex, with lots of macros, autogenerated.


%% this would appear in the generated file
\newcommand{\OwlClassowlthing}{
  \begin{omn}
Class: owl:Thing
  \end{omn}
}

%% then in your latex file you would do
\owlclass{owl}{Thing}

%% which would just resolve to the class above

The only worry with this is that latex would then have to read a large file into latex, even if most of the macros are not used. This might be really, really slow. Well, we can but try.

As before, the current version is available at git://github.com/phillord/literate_omn.git.