Archive for the ‘Labbook’ Category

Recently, I and my PhD student, Jennifer Warrender have become interested in the representation of karyotypes. There are descriptions of chromosome complement of an individual. In essence, they are a birds-eye view of the genome. Normally, they are described using a karyotype string, so my karyotype would be 46,XY (probably!) which is normal male. When describing abnormalities, these can get very complex; take, for example, 46,XX,der(9),t(9;11)(p22;q23)t(11;12)(p13;q22)c,der(11)t(9;11)t(11;12)c,der(12)t(11;12)c[20] which describes a patient with multiple translocations.

There are a couple of reasons why we thought that it would be interesting to turn this into an ontological representation. The karyotype strings are not very parsable, and lack a computable specification which makes it very hard to check that they are correct, or to search and query over them. Having an ontological representation should help with the specification; additionally, using OWL we should be able to reason over by the specification and individual karyotype.

So, far it is turning out to be quite an interesting experience. The definitive resource for these strings is called ISCN (http://www.worldcat.org/title/iscn-2009/oclc/496946115). One of the things that we thought would happen, is that we would detect some inconsistencies in this specification; this often happens when producing a computable specification from a human-readable one, and it is not a reflection on the authors of the original. Sure enough, this is turning out to be the case. One the second page of the specification (after front matter), for instance, we find this statement:

Group G (21-22,Y): Short acrocentric chromosomes with satellites. The Y chromosome bears no satellites

pg7
— ISCN

The two sentences here are contradictory in themselves; either, the Y chromosome should not be in Group G, or Group G should not be defined to bear satellites.

There is an apparently similar statement on the page before, which says:

Not all the chromosomes in the D and G groups show satellites on their short arms in a single cell

pg6
— ISCN

However, this there is a significant difference here; to fill in the cytogenetic background, a satellite is a differentially staining visible body on the chromosome, found near the the centromere of some chromosomes. The name “satellite” actually comes from a different property of satellite DNA, that is is often a different density from bulk genomic DNA, so when spun in a density gradient, appears as a smaller band segregated from the main genomic band. Nowadays, satellite DNA is known to be highly repetitive sequence that we are no longer allowed to called Junk DNA (http://www.bbc.co.uk/news/health-19202141). In humans, the most common sequence is known as the alpha satellite. The different densities sometimes seen in gradients occurs where the GC content is different from bulk. The differential staining patterns seen cytogenetically occurs because the repetitive DNA is normally packed as heterochromatin.

Which leads us to understand the difference between our two quotes from ISCN. Although, repetitive DNA is variable in detail, it is not that variable, and will remain constant within an individual; if one Chromosome 22 has alpha satellite at the centromere, then so will another. The key here is the caveat “in a single cell”. In a chromosome spread from a single cell, an individual chromosome may or may not be at the right stage of condensation for the satellite to be visible.

So, we have two usages of the word chromosome. In the first quote, we are referring to a canonical chromosome; so, canonically, it is true that all human chromosome 22s contain a satellite. In the second statement, we are referring to a single chromosome, in a single cell, from a single human. There is no contradiction here because the existence of a single chromosome 22 without satellites is not inconsistent with the canonical chromosome 22 having satellites.

Actually, ontologically, the situation is slightly more complex still. The second quote says “Not all..show satellites” (our emphasis). It is not an issue of whether the chromosomes have or do not have the relevant DNA, it is whether we happen to be able to see this after appropriate staining.

We will consider both of these issues — canonicalization, and visualisation in future posts.


Authors

This post was authored by Phillip Lord and Jennifer Warrender.

Bibliography

I have described my experiences of using Emacs for writing ontologies previously (http://www.russet.org.uk/blog/2161). I was not entirely happy with omn-mode, even after recent changes (http://www.russet.org.uk/blog/2185), so I have taken the opportunity to update it a little more. This article most describes some implementation changes.

Originally, omn-mode was based on generic.el; this is a package which enables development of simple major modes. However, the emphasis was on simple, and my code was getting a little bit complex; generic.el was starting to get in the way. Moving to the define-derived-mode was not pain-free; it involved redoing already functioning code which is always a bit down heartening but probably worthwhile.

One thing that I did have problems with was getting comments to work properly (Emacs syntax tables are nasty). This never worked properly anyway, as I had defined “# ” as the comment starter; the reason for this is that Manchester syntax also uses URIs, within which any character is basically legal, including “#” but not a space. I think I have now found a better workaround for this. Manchester syntax requires URIs to be <surrounded>. I’ve now defined “<” and “>” to be string delimiters, which means that I can now use “#” as a comment starter without it being recognised in a URI. This isn’t perfect; in particular, Emacs will not recognise “<” and “>” as paired, so <URI> is equivalent to <URI< and >URI>, although only the former is correct.

I’ve also update the indentation logic somewhat. This now uses the syntax-table parser rather than font-lock to work out whether point is in comment or string. Of course, it should have been this way all along but the syntax-ppss function is new to me.

Finally, I have added some electric features. Because the identation engine works on keywords rather than brackets, the identation level really needs to be calculated when a line is finished rather than when it is started; electric mode means that the indentation updates as the user types. It’s a little jarring in some ways, although Python does something similar and for the same reason.

So, no major changes from the point of view of the user, except that it should all just work a little better. Code at Google code or website.

Bibliography

I have just started to build an ontology and I have to admit that it has been a while since I have done this; I think that the last time was when writing a paper about function (10.1186/2041-1480-1-S1-S4), so I was interested to see how it would work. I’ve have been engaged in discussions recently about syntactic aspects of OWL (http://www.russet.org.uk/blog/2040); the main reason for this is my long-held believe of the need for editing tools that work at the syntactic level; this allows us to plug in to the enormous body of programming tools supporting building, collaborative development, versioning and so on. So, I decided to build the entire thing using Emacs; the nature of the ontology also meant that I wanted to reboot my long-neglected attempts to bring literate development to ontologies (http://www.russet.org.uk/blog/1213). While it is not a large ontology I did manage 60 classes in an afternoon, so I am quite pleased with the results.

My basic working environment is as follows:

Emacs
for editing
omn-mode.el
providing basic OWL Manchester Syntax Support
pabbrev.el
dynamic and automatic abbreviation support
Protege
for viewing the ontology, and running reasoners

As an environment, this works quite well now. Although I have tried it before, Protege seems to work much better now when (mis)-used as a display environment. First, when loading a file into Protege it gives a report of errors, but has a nice “reload” button so that I can now fix the errors (or at least try to). Second, after an ontology has been loaded, Protege will now detect that the file has changed an offer to reload it. In general, it works quite well. There are still some issues — I have not had time to work this out reproducibly enough for a bug report — but there are times when Protege breaks, particularly when I change the file and break the syntax. I can live with this — while restarting Protege is slow in computer time with the Java loading, it doesn’t require lots of clicking to get back to where I started (“Open Recent”), so it’s quick for the user. Finally, and most importantly of all, Protege’s Manchester syntax parser now seems to support comment characters correctly — at least in my hands it treats “#” as a comment character.

Using Emacs as an editing environment over Manchester syntax has some considerable advantages over using Protege raw. I am very keyboard-centric while Protege is very mouse-centric; just moving backward and forward to class definitions with incremental search is much better than in Protege. Simple things like search and replace, especially with regexp just happens naturally in Emacs and there is no equivalent in Protege.

I wrote omn-mode.el a long time ago now; I don’t remember when although the last update to it’s original Subversion repository is from 2005 according to the :Date : inside the file. omn-mode is based on generic.el, and it is starting to get a little stretched for this now; I should move it to using the normal define-major-mode functionality. However, this does reflect what it is best at which is syntax highlighting. I had to fiddle with this a bit, to add support for some extra keywords and just make it more consistant. I am working on the basis that everything should be syntax-highlighted; although this makes Manchester Syntax files a little garish, it helps get the syntax correct. I also improved some of the regexps: so " some " has been changed to "\\<some\\>".

As Protege is now doing comments, I have added proper support for these also, although as a comment character “#” is a bit irritating, given that is also a valid part of a URL. So I fudged a bit here and used “# ” as a two character start. This means that multiple comment characters such as “##” which I use in Lisp is not going to work. But fixing the situation would be much, much harder. It seems to me that “#” is not an ideal choice; a lisp-like “;” would be better; I think, technically, you can find a “;” in a URL, but I have never seen one.

While the previous indentation engine (based purely on the previous line) worked surprisingly well, I have also improved this now. Actually Manchester syntax is surprisingly easy to indent reasonably well; an ontology is essentially a bag of axioms, so it has relatively little structure at the syntactic level, which means that my engine only uses three indentation levels.

Finally, I’ve also updated the mode to recognise both ":", and "_" as a Word constituant, for reasons that should become clear.

pabbrev.el is my own dynamic “as you type” abbreviation expansion package, and it is still the nicest thing that I have written for Emacs. I use it every day (and am using it now). I did notice one minor bug with it which was making is misbehave, but the main change comes from the update to omn-mode’s syntax table. pabbrev.el now expands prefixes and terms as a single word. For me, at least, this seems to work well. The change to the syntax table will also affect other dynamic abbreviation packages as well, including dabbrev and hippie-expand. Similarly underscore_separated_terms should expand.

The combination of these changes means that, for me anyway, Emacs is now quite a capable Manchester syntax editor. I have only really touched here on the things that I have written, but editing OWL ontologies at the textual level also really does open up many possibilities. Having an environment that I control is also useful. I would like to extend Manchester syntax to support semantics-free identifiers (http://www.russet.org.uk/blog/2040). I can now make an initial implementation of this, using a pre-processsor to unwind the Alias: definitions to produce a “real” Manchester syntax file.

Bibliography