Archive for the ‘Tech’ Category

Ah, it does on and on. After my last attempt at literate OWL programming, called omnsplit, I decided that there was a problem; this version splits the OWL file into individual statements, and puts them into files with the same name as the OWL class (property, or whatever).

The problem is that, for an ontology like OBI, you get 1400 individual files; this is just inconvienient as many applications don’t like this many files in a directory. Also, there is a naming constraint; you can only use characters legal in the file system; this doesn’t include “:” if you want to be Windows (NTFS) compliant.

So, for my new system, I decided to generate an index file, which just points at locations in the ontology file. Initially, I was just going to index the main ontology file; in the end, I decided a partial copy was the way forward; generating both the index and indexed file ensure that they will stay in-sync.

It required a bit of nasty latex hacking; the basic problem was avoiding the limitation of being only able to use legal LaTeX macro characters (that is letters). The system now works like this:



%% This is generated by python which also generates the
%% function_ont.spt file which is a copy of the ontology (with a
%% few new lines gone.

%% This just defines a new macro in what appears to be an
%% unnecessarily complex way.
\expandafter\def\csname OmnEntityHeaderheader\endcsname%
{\lstinputlisting[language=omn,firstline=1,lastline=8]{function_ont.spt}}

%% But the use of \expandafter and \csname means that you can
%% use any character you like, including underscores and numbers
%% in the macro name.
\expandafter\def\csname OmnEntityObjectPropertyhas_role\endcsname%
{\lstinputlisting[language=omn,firstline=206,lastline=219]{function_ont.spt}}

%% We can now define two commands in the style file. Again
%% we use \csname so that we are not bound to characters legal
%% in latex macros.
\newcommand{\omnclass}[2]{\csname OmnEntityClass#1#2\endcsname}
\newcommand{\omnobjprop}[2]{\csname OmnEntityObjectProperty#1#2\endcsname}

%% now in our source, we can do things like this.
\omnobjprop{}{has_role}

Using an index in this way also has another advantage. I’ve had to make a decision whether to go with rdfs:label or the entity name. I can now back out of this; I can just use both in the index file, without too much extra space, so that either would be referencable within the latex.

To me, this feels like the right solution. It’s relatively simple (with a bit of nasty latex, which is nicely hidden), it doesn’t depend on the file system. It needs a bit more work to bring it to completion, but not that much.

Sadly bio-ontologies looms, so next week will be getting ready for that; perhaps I can finish this off on the way back. “Sadly” is perhaps a poor choice of words; I’m greatly looking forward to it, but I’ve kind of had the bit between my teeth with python and latex hacking for the last few weeks.

Just upgraded to WordPress 2.8. The automatic update didn’t work; this seems to be a continual problem which stems from wordpress not being in the default location. For some reason, it wants to push from the new version rather than pull under these circumstances. Not good.

So, I did the manual upgrade; unfortunately the admin page crashed out with an error:

PHP Fatal error: Call to a member function read() on a non-object in wp-includes/theme.php on line 387

This has been reported here and here

It’s this bit of code causing the problems.


$template_dir = @ dir("$theme_root/$template");
                if ( $template_dir ) {
                        while ( ($file = $template_dir->read()) !== false ) {
// etc

It appeared to be only be my modified version of the theme (Evanesence) causing the problem; it’s not very modified, so I removed them one by one. For no readily apparent reason the problem appears to be a subdirectory called “images.old”. Surely, not a good reason for a crash.

Weird and wonderful.

After a bit of struggle, I now have another literate OWL tool working, along the lines discussed in a previous blog post. Rather than generating the OWL documentation, I now split a Manchester syntax file up, so that I can refer to bits of it. I have this working with OBI, using Protege to produce a single merged ontology file, in Manchester syntax.

The current implementation is rather simple; it produces one file-per-entity in the OWL file which I don’t think is entirely good. When run on OBI, it creates over 1400 files which is a lot. The other problem is that I’ve had to do some dubious hacking to get the file names work out. Firstly, I have to remove spaces and “\”‘s, as wel as “:” which is illegal on NTFS.

There’s also a problem with some of the OWL. Unfortunately, the OBI to OWL conversion process has a reification step which I don’t quite understand the purpose of. This comes out as this sort of anonymous individual. I’m not sure at all how the definition has come out as the rdfs:label, but, for sure, you can’t use this as a filename!


Individual: relationship:genid7

    Annotations:
        rdfs:label "C located_in C' if and only if: given any c that
instantiates C at a time t, there is some c' such that: c' instantiates
C' at time t and c *located_in* c'. (Here *located_in* is the
instance-level location relation.)"@en,
        oboInOwl:hasDbXref relationship:genid8

    Types:
        oboInOwl:Definition

I think I might change the implementation a bit, though. Having 1400 files in one directory is not good. My idea is to serialize the entire file out as latex, with lots of macros, autogenerated.


%% this would appear in the generated file
\newcommand{\OwlClassowlthing}{
  \begin{omn}
Class: owl:Thing
  \end{omn}
}

%% then in your latex file you would do
\owlclass{owl}{Thing}

%% which would just resolve to the class above

The only worry with this is that latex would then have to read a large file into latex, even if most of the macros are not used. This might be really, really slow. Well, we can but try.

As before, the current version is available at git://github.com/phillord/literate_omn.git.

Well, after a reasonable degree of struggle, I managed to get the first version of my literate OWL system working. As well as learning python, I’ve had a go with git; my repo is hosted on github at git://github.com/phillord/literate_omn.git. There are three components.

omnextract.py this pulls out all the referenced omn files from the TeX document and produces the complete omn file.
omn.sty this is a driver for the listings package which does syntax highlighting in TeX.
omndoc.sty this provides commands for including files into the TeX. It’s a thin wrapper around the listings package.

I decided to make omn.sty seperate from omndoc.sty as it works standalone, if you just want to use the listings package on its own. At the moment, you can only include files; environments don’t work. You can see the the pdf it creates from this TeX


\documentclass{article}

\usepackage[pdftex]{color}
\usepackage{omndoc}

\title{A Test Document for OMNDoc}
\author{Phillip Lord}
%% should be ignored by latex, put read by python

\omndoc{all_test.omn}

\begin{document}
\maketitle

Here is a piece of OWL that should be readable in the documentation and in the
OMN output.

\begin{omn}
Class: FirstClass
\end{omn}

\omn{first.pomn}

Here is a piece of OWL that should be readable in the OMN output but is to
boring to be worth of consideration for the documentation.

% \ignore{
%   \begin{omn}
%     Class: BoringOWL
%   \end{omn}
% }

\ignore{\omn{second.pomn}}

Here is a piece of broken OWL that should be rendered in the documentation (as
broken!) but should be ignored in the OMN.

% \begin{notomn}
% Clazz: BrokenOmn
% \end{notomn}

\notomn{third.pomn}

\end{document}

I’m starting to debate with myself, though, whether I have gone the right route here. The problem is that splitting the omn file up into bits is a pain. It only supports one way of working; if you want to use Protege, for example, to edit the file, you can’t; you can only view. We even miss the big advantage of literate programming; one source for both document and computation. But, then, you are stuck with a poor editing environment for either the documentation or computational representation.

I’ve been thinking instead of a system which would like this:


\omndoc{function.omn}

\omnClass{Function}

\omnProperty{has_role}

\omnSummary{}
\omnMissing{}

Now, the python component would split the function.omn file instead of combining it. Each class, individual or property would be but into it’s own file. The \omnClass macro would then just be a simple include (again using the listings package; it would show the class inline. \omnSummary would include some TeX (generated from python) saying how many classes and so forth were in the omn file; \omnMissing would produce a list of Classes that are not explicitly included. Given a big monitor, you could work on the two sources (documentation and ontology) side-by-side, with only a little bit of editing to support jump-to or equivalent. Finally, it would be more syntax-independent. The TeX would not need to be changed to support, for example, the XML syntax. Just some python to split the XML document up into snippets.

I shall start coding this over the next couple of days. I think I already have most of the python that I need so, hopefully, it should not take too long.

Learning a new language is always a bit stressful. I thought that I would learn python; I need a new, rapid development, build some scripts, but don’t look as awful as perl type language. I’ve recently learnt lua which was fun, but then it’s meant as a very small, quick langauge. It’s nice, but not really the perl-u-like that I wanted.

I have actually been through the process of learning python in the past; I used to generate my website with ht2html which was quite cute and did the job; it was written in python, and I needed some skills to fiddle with it’s output. In the end, I decided that table within table presentation was not ideal and that CSS was the way to go, so I moved to muse which I still use nowadays.

As always, learning a new language is frustrating as you realise that you don’t know how to do even the most elementary things, and bugs are a nightmare to hunt down. Simon has been helping me lots with some of the my more “I’ve really screwed this up question” and I now have a version working version of my literate owl system. I’ll post the results of this soon; there are a few tweaks that need to happen first.

Along the way, I came across a very wierd problem. My script was failing totally; it always appeared to crash with a syntax error. It took several seconds to do this and, at the same time, the mouse cursor changed into a cross. I came across a thread which looked like the same thing, but in a totally different setting. The cause? Well, my script was…


#/usr/bin/env python

import re
import sys

def main()
    TheProgramHere()

The problem is on the first line; the second character should be !. Without this the script is interpreted my BASH; import is part of ImageMagick. I finally worked out what was happening when I found two large files, one called “re” and one called “sys” in the local directory. Computers can be irritating at times.

Well, this is it; although I have been using this blog for a week or so now, I haven’t told anyone about it because it wasn’t quite ready. Today, with a little VirtualHost hacking and it’s finally up and running. I’m not totally happy with the theme yet, but that can change over time. The basic content uploaded, commentary and permalinks seems to all be working. Many thanks to Dan Swan who set up wordpress and has lent me a bit of his virtual machine. An excellent job, as ever.

Say good bye to my old trusty site which is now decommissioned. Exercise in irrelevance is dead, long live….

While it’s not a major problem, the inability to uniquely and reliably identifier a particular scientist is a niggle; a few years ago, I was distressed to find that I was scheduled to give a talk at an eScience conference about security; anyone who knows me, will understand how implausible this was. I hadn’t considered the possibility that there was another Phillip Lord in eScience. It’s not that common a name.

So, what would we want form such a ID system? I’ve think that the basic requirements would be:

  • the IDs should be unique; one ID only ever refers to one scientist.
  • the reverse should also be true; one scientist should not need to change their ID.
  • the ID should be printable, so that it can appear in papers.
  • the ID should be usable with a resolution system.

I think that this is it. I would say, also, that there are some softer requirements. Firstly, I think that the IDs should be useful to the scientist (above and beyond being able to link all their papers are research results); this would give them more immediate feedback, so that they would find the system to be a good thing, rather than a burden. Secondly, the system should be familiar and easy to use. Finally, as an anti-requirement, the system need not be secure; that is, it would be possible for someone to pretend to be me; this is not to say we couldn’t layer a secure identification system on top of the IDs.

So I thought about what form the ID would take. My first thought was just to layer the system on top of a first name, surname of the scientist. This has the big advantage, of course, that it makes the system easy to use; scientists already know their own names (mostly) and so does everyone else. People will remember the IDs easily. The problem is, of course, that peoples’ names are not fixed; women, particularly, are likely to change their names, and once the link between name and identifier is broken the advantage is lost.

My second thought was that we could use identifiers chosen by the scientist; this is not a bad idea; of course, it’s harder for humans to link between the ID and (other) scientists, but in time you would come to know IDs for most of the people in your domain. However, this form of identifer is also likely to become broken over time: firstly, many scientists will just want to choose their names, so we have the same problem as before; secondly, some scientists will just want to change their IDs — while peanutbutter or DullHunk might work now, it is possible that the owners of these names will come to regret them like the “Phil loves Newcastle United” tatoo that I don’t have on my forehead.

In the end, I’ve come to the conclusion that only a semantics free identifier actually makes any sense. This is clear the least memorable route, but even here it’s not too bad; I know my NI number by heart because I use it a lot (or used it a lot at one point in time). In practice, most scientists read stuff on the web, so this could be resolved to show the full name automatically; in most cases, with papers for instance, it would be augmented with a standard name anyway.

So, what form of ID do we want? Well, the simplest form would be a six-letter code. This gives 300,000,000 alternatives; if we add in numbers this rises to a litle over 2 billion. Probably more than enough for scientists now and into the future. The system could be extended if the name space ran out. However, I think we could improve the system by adding an extra letter to make 7; this would now mean that we could ensure that no two scientists had a ID with only a single edit difference; essentially, one letter would be redundant. Finally, we could add a final letter to make a checksum — basically, treat the letters as base 26, multiple them, divide by 26, take the remainder and use this as the last letter. This would allow an easy validation step. Finally, we might want to do a dictionary passed block on some names; pity the poor scientist who ended up as NOBRAINS or other far worse 8 letter IDs.

As it stands, I don’t think that this would place too much load on scientists, but it would also not appeal to people; the big win would come when they would use these IDs to make their daily life easier. This could be achieved by sticking an authentication protocol, OpenID being the obvious one, although the IDs are generic enough that any authentication system could be stuffed on the end; as the IDs are not going to change over the life of a scientist this should reduce the management load of yet another identifier. Potentially, we could login to eduroam, various academic tools, wiki’s and the like all with a single ID. At the last RIN/DCC meeting, many people argued that they need username/password registration; I suggested that this was a significant pain and barrier to reuse; this is true, but the barrier gets a lot less if the registration process either disappears or every scientist gets to reuse the same ID.

Technologically, I don’t think that this would take a lot of effort to set up. Socially, the demands would be huge; for it to work, the basic technology is not enough; we would need to put in infrastructure to make sure key tools supported the system; JeS and Shibboleth would be obvious first points of contact; adding an OpenID provider would support less formal resources (such as project Wikis), but collaborating with Wikipedia and paying them to add support would help.

In some sense, I look forward to the day that I cease to be Phil Lord and become ADSJWOSK.

It’s finally happened. I’ve decided to move from generating my blog with muse to using a WordPress hosted version. The muse generated version is a set of static pages; I like the simplicity of this, but it’s just not powerful enough. I wanted to keep the ability to edit my posts with a text editor; for this, I am using asciidoc and blogpost which I hope will function as easily as muse. It’s going to take writing a bit of support code, but it should be relatively light; in the meantime, it should stop people moaning about my awful blog design.

As I post this, I’ve not gone 100% live yet; there are still a few things left to do. When it’s all finally ready, I shall post my last note to my old blog, and the change over will have happened.

Feels a bit sad, after three years using the old technique, but change happens to us all in the end.

Today, iplayer tells me "You have download 2.22 of content" with a checkbox saying "Do not show this message". Robbed of a unit the former looks messy, robbed of "again" the latter looks a bit "Do not press this button again".

Download times have come down a bit. Still — 4 hours now for a 60 min programme. I even managed to get something to play today; the frame rate appeared to be about 5/second.

Originally published on my old blog site.

What a flurry of posts? I went mad today and joined twitter and friendfeed both at the same time. Gosh, what a time waster this stuff all is.

Right, just got to twitter about posting on my blog.

Originally published on my old blog site.