Archive for September, 2009

At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.

In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.

While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.

OBO makes use of a numerical, semantics-free identifier and a namespace, with a syntax of NAMESPACE:IDENTIFER. So, a Gene Ontology term looks like GO:0003674. The namespace is not constrained to be two-letters and has mechanisms for world-uniqueness, in that people talk to each other and sort it out, if they clash. The use of a semantics-free identifier means that term names can be changed while maintaining the implied meaning with the term; the label for the term, meanwhile, provides a human readable version, which can be shown to users of the ontology. I will call these the OBO identifier and OBO label respectively.

Translating this, however, into OWL, including Manchester syntax causes significant problems. The naturalistic translation is to turn the OBO identifier onto the identifier in OWL; the OBO namespace would become an XML namespace, the OBO identifier would become an XML identifier. Unfortunately, this doesn’t work. First, the OBO identifier is genuniely just a short string and XML requires a URI; so a mapping between OBO identifiers and URIs is necessary. Second, the OBO identifier is numerical; unfortunately, while the identifiers in OWL can contain numbers they have to start with a non-numerical character. The standard translation, therefore, uses in most cases an OBO wide URL (http://purl.obolibrary.org/obo/), although some ontologies have their own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO identifier is mapping to an valid identifer by sticking a prefix onto the numbers. So, we have identifiers such as GO:GO_0042101 or obo:OBI_1110045. There are also some OBO ontologies for which this does NOT occur; for instance, BFO classes in OBI come out with identifiers of the form snap:Continuant or span:Process, except for one which is bfo:Entity.

Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:


Class: obo:OBI_1110161

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:


[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity

The cross reference in this case is a subsumption link to IAO:0000030

One solution would be to use the rdfs:label in place of the identifier. So, we would have something that looked like this:


Class: "T cell epitope ELISA IL-1b assay" @en

    Annotations:
        obo:identifier "1110161"

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

Other identifiers would also have to be changed, also. I’ve also added the odo:identifier line (which I think would be valid, but might require the creation of an OWL individual). Without this, it would not be possible to go backward.

However, this is problematic as it changes the serializiation between the OWL Manchester syntax and other syntaxes of OWL. The class identifier has to be URI legal, and OBO label here is not. We could do a syntactic conversion (e.g. T%20%cell%20%epitope) but this, again, reduces readiblity, defeating the point. Also, the rdfs:label would become part of the final identifier URI, which then becomes a semantics heavy identifier. Finally, it would require a OBO specific loading of the Manchester syntax, taking the URI identifier from the annotation block, and the rdfs:label from the class name.

So, is there any solution. First, there are tooling solutions. In Protege, it is already possible to use any component of the definition in the display. So, you can set the rdfs:label as the main display form. Tooling solutions are attractive, but there is a problem; you have to extend all tools to support this view; I realise that the number of freaks who wish to edit OWL with emacs is not that large, so this might not seem an issue. However, many people wish to develop ontologies collaboratively using version control; if you want to compare versions you use diff, so we now need an Manchester syntax diff viewer. Also, if you want to do some perl hacking, or straight-forward search and replace, again, it’s all harder.

To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.

So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
        obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
        some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
        and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
             some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
             ))

This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.

So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.


Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661 [blah],
        obo:OBI_0000299 [longer blah]
        some (obo:IAO_0000109 [more]
        and (obo:IAO_0000136 [stuff]
        some obo:OBI_11101 [OBI Thing]
        ))

This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.

Comments welcome!

This is the third year in a row that I have been to Neuroinformatics (or it’s forerunner, Databasing the Brain). It’s still turning out to be an enjoyable meeting, even though there is still lots of it that I don’t understand. Come to think of, perhaps because there is lots of it that I don’t understand.

Pilsen (or Plzen) is, perhaps, a strange place for the meeting. It’s a bit of a pig to get to, as the airport is in Prague. Likewise, the conference centre was a bit out of town, so you had to get a taxi if you wanted food in the evening. Still the venue itself worked well. Slightly flaky wireless, but it had tables upstairs on a balcony; a lot of people migrated up there as the meeting went on, making the auditorium a little deserted.

Although, I’ve said I didn’t understand lots of it, many of the keynotes this year were bioinformatics, systems biology or data integration which I know well. As well as that, there was a (semantic) web and ontology section. I enjoyed Tim Clarks talk, as he’s made stuff that lots of people are actually using, although I don’t think he explained why during his talk.

The section of high performance computing was probably the least relevant. While they’ve become interested in power consumption recently, these guys are still obsessed with teraflops (…now petaflops…now exaflops). To be honest, I don’t care. With more power, you can build more granular, higher resolution models, but I doubt that will bring you anything, unless you also have more granular data. They should be worried about discs — always the Cindarella of the hardware world, only slightly more interesting than printers — but it’s discs which carry the data. While we are at it, spinning discs use lots of power. And they have more flashing lights than CPUs. The hardware guys should be talking about disc space. The neuroscientists should be worrying about filling discs up. Neuroinformaticians should make that they end up with an exabyte dataset; not 1000 petabyte datasets or worse, 1,000,000 gigabyte datasets.

I tried to get a bit of Web 2.0 stuff happening at the meeting. David Sutherland set up a friendfeed room. Second day, we were sitting next to each other like two sad blokes at a party full of women, sending each other messages on their iphones. Although, it was a neuroinformatics meeting, so largely without the women. Second day, mostly it was just me, sad, lonely and pathetic. Still, having said that, I did manage to meet almost all of those subscribed to the room, which you couldn’t achieve at ISMB nowadays. Pavan Ramkumar said hello at lunch, and then later at the airport. I met Sarah Maynard at her poster; it had ontologies, OWL and information content-based similarity measures; bound to make me happy. Only Lisa Kjonigsen remained in cyberspace only. With luck, next year, more people will join; not least because I’ll probably not go to Japan.

I had a quick go at live blogging also; to be honest, I am not a natural. The problem is I have too much desire to editorialise. The roboblogger tells me that she just blogs the notes that she would have taken anyway; my notes, on the other hand, are full of comment, invective and questions. Perhaps I could just put these into the asciidoc source of my blog as comments. I stopped live blogging on the last day, not for these reasons, but largely as a desire not to hold my crushing ignorance of the topics being discussed up to public scrutiny.

Neuroinformatics (the meeting) is changing. I have to believe that if there is more about genomic and multiomic data integration that this has to be a good thing. The brain is a hard to thing to figure out; I have to believe that using more data, more types of data and a heavier use of nice, simple, model organisms is going to increase the rate of advance; with all the fuss about systems biology, it’s easy to forget the fabulous success of the last 100 years of reductionism biology, which made systems biology possible. This has to be the way forward for neuroscience. Even if it does make the meeting more usual and, perhaps, less interesting for me as a result.

Make has been driving me mad for the last week. It keeps on complaining about “modification time in the future”. Normally, this happens because you’re using rmeote files from a server which doesn’t have sync’d time. But this is rare these days. Anyway, it was complain that the file was 10E+06 seconds in the future; that’s a really, really big clock skew.

Did a bit of poking around. One possibility I found was that it was due to a limitation in FAT32; hmmm, not likely. Didn’t have time for more. I am at a conference; supposed to be paying some attention.

Anyway, the solution came to me today. Or rather the cause, because the solution was obvious. Turns up, when I changed timezone to Czech, I pushed the month back to August. What I don’t understand is that I was sure windows synced to a NTP server running somewhere. What does it do when you change the month?

This is a live blog from Neuroinformatics 2009.

Data management: View from 50,000 feet — dimensions are amount of structure and the number of data sources. More structure, less data sources.

Distinguishes between parallelisation and heterogeneity. Can distribute data across tables in an organised way — this is parallelisation; or, you can have lots of data, spread across resources, with multiple entities and with no common plan.

Outline — data integration and suggest data spaces as a solution.

Databases are so successful because it provides a level of abstraction over the data. Data integration is a higher level of abstraction still because you don’t have to worry how the data is stored or structured.

Mediated schema, uses a mediation language, a mapping tool, and then a set of wrappers over the datasources, which map them to a common syntax (relational database for example).

So, we know how to do it, but the cost of building data integration systems are really high. Creating the mediated schema or ontology is hard; sometimes it’s impossible. Mapping source to mediated schema can be a nightmare, because you need many people from both sides of the mediation. Are some automated systems, but human is always needed. Data level mappings (changing IDs, synonyms and so on). Social costs.

One of the problems with data integration is that it costs a lot early, but yields very little till quite a long time on, and it’s all done. What we really want is pay-as-you-go data management; want useful data out early and constantly.

Everytime human does something with data, they are telling you some information about the data. If you can capture this information then you can useful stuff with this.

Structured data on the web: the deep web, which is data behind forms; and two others. So, deep web. Knowledge which is not accessible through general purpose search engines — cars, houses and so on are examples of this. Uses data spaces as a way of doing this; learned different 5000 data sources in two months.

One possible way to access the deep web is to put queries against web forms. Have to guess what to put in; one way is to just use words on the form page in the first place. Currently, google gives much knowledge from this deep web; has the biggest impact on the deep web.

Web tables; can we exploit the knowledge from the tables better. There are 14billion tables on the web, of which about 154million are interesting — rest formatting or whatever. First problem is to identify schema elements; these are expressible in HTML but actually no one uses it. So have to guess. They got 2.6 million schemas. Would be good to put these into automcomplete (although not sure where).

Fusion tables lets you upload data and collaborate on the visualisation of it. Changes the visualisation options depending on the data types.

Conclusions — bottom up data-integration, which is more realistic than top-down. Dataspaces are an approach. Fusion tables is good.

Guiding principles of NIF. Builds heavily on existing technologies. Information resources come in all sorts of size and shape.

Highest level NIF registry. Web index of resources which are relevant to neurosciences.

NIF resource diversity — three different levels of data, with increasing amount of structure.

Is GRM1 in cerebral cortex? NIF system allows searching over multiple different resources. But problems; inconsistent and sparse annotation of scientific data. Many different names of the same thing and so on. Added to this there are over 2000 databases in the registry.

Uses mixed searching so that both ontological information and string based systems important for where there is no annotation. Can also do query expansion with ontology to get better querying.

Building ontologies is difficult even for limited domains, never mind all of neurosciences. Trying to do this with multiple levels. NeuroLex — single inheritance, lexicon. NIFSTD, standardize modules under same upper ontology. NIFPlus — create intra-domain and more useful hierarchies using properties and restrictions. .

Using logical classification as a result of properties of the entities.

Question — how to get the community involved. Need to provide an easy to use platform for community collaboration. They have a semantic wiki for contributing to neurolex. Really lowers the barries for entry for domain experts who wish to use (and extend!) these terms.

Lots of people are starting to use the resources (they find this out because people complain when the systems are broken!).

Contributing to Neurolex. Don’t need an account, but better if you have one, everything online. Many thing that they are looking at is content, content, content. More stuff the better. Finally, getting people to value ontologists is really important.

This is a live blog from Neuroinformatics 2009.

Motivation, what is the common feature of a set of disorders. They are all complex disorders, which we don’t really understand.

Alzforum is a nice example of an early web community. Alzheimers forum. Works as an ongoing journal club, with curated discussions. Started off during the early days of the web.

Developed StemBook which is an online book, launched about a year ago. Discussion of stuff that is happening. pd online research, is another alzheimers website, using a toolkit that they have developed. Linking across these forums can be a problem; need some forms of shared terminology server. Science Collaboration Framework. Based around drupal, allows common collaborative tools for biomedicine, shared ontologies/vocabulary and so on.

How do you link between these communities? Issues of semantic annotation; how does this happen? Are systems which allow you to guess what an ontology is; building system which should work across lots of different content management faciltiies. This can bring lots of benefits, as the additional semantics allows you to work around synonyms etc.

Discourse ontologies. SALT — semantic annotation of latex.

Need to support a spectrum of different knowledge structures, theasurai and so on. Less complex == more tractable to biologists. Complex and formalised, tractable to computers.

Are now integration discourse ontology into myexperiment and others.

Using existing work on entity recognition and try and produce a provenance aware representation of these results.

This is a live blog from Neuroinformatics 2009.

Creative Commons is based around issues with data and copyright, trying to change the idea that not sharing is the default. Science Commons looks at the issues specific to science.

Semantic web in a nutshell; adds to web standards and practices encouraging, common naming, ontology development, expression in knowledge representation language, easy integration over multiple sources, works both inside and outside the organisational boundaries.

Why should you want this? Network effects, people can use their own skills, and combine knowledge from many different sources. Provides efficiencies at the global scale.

Copy and paste for the semantic web; a mashup with knowledge from Allen brain institute, and google API. Had to screenscrape Allen brain for this.

Trying to look for druggable targets in pyramidal neurons. Google provides too many results, so does pubmed. Shows complex SPARQL query over the knowledge from the web; crossing from MESH to gene to GO. This may not be the best query, but it’s none the less useful and will make biologists happy.

A brief jump into ontology making. Terms that mix up material and neurotransmitter. Uses example, peptide, neurotransmitter, hormone and ligand; all of these could be peptides, although not necessarily. Need to untangle these. In many cases, these have already been done (ChEBI). Move from English to OWL.

How to build consensus in ontology building — somewhat related to OBOFoundary rules. Another program is INCF program for ontology of neural structures.

Challenges — building bigger ontologies is hard. Barrier to sharing are a major difficulty.

This is a live blog from Neuroinformatics 2009.

All of our observations about the brain are in some sense reductionist. We are looking at only thing at a time, and hope to infer knowledge from this. The knowledge is multi-technique — no single experiment is going to give the entire answer. Need to combine and integrate. Most of our data is descriptive — MRI is not that different from phrenology in one sense.

Process of dissemination — the web and equivalent — has been transformative of neurosciences. Large scale consortia are also important; has been involved in lots of these — sometimes painful — but useful. Good to learn the lessons from these.

The biggest lession from multisite brain mapping projects — the data needs to be open. If that data is open people will come, so long as it’s described.

Are new techniques coming along all the time; every near there is a new way of looking at stuff. Need to combine these forms of the data with knowledge from the past. There is a cost to this — digitizing and representing histology for instance, creates a lot of data. Currently can at 10 micrometre resolution on whole brain in terabytes of data.

One of the big issues is that, lots of the data is under patient confidentiality. Often can only store and check deidentified data. Are problems with metadata — some places have sent “phantom” images — which are used to callibrate the equipment, with a patient name on it. This sort of thing reduces the value. Need to check the data constantly.

Data Sharing and access control. Is a spectrum. Can release the data instantly it’s produced, six month after deposition, after publication, or never. Have a system to support this, with the acquirer having control over this.

Hardware — spend lots of money and eventually it will work. Have a 4PB system now, Uses a robotized tape system because spinning disks are too expensive.

Computer crashed out at this point, and I had to reboot, but he talked about Alzheimers. Gives a nice hypothesis that multi image databases could potentially answer.

With BIRN, data does not necessarily need to be centralised — it is possible to support distributed, but federated, databases. Have managed to aggregate and bring together information from many different resources. Databases need to have a suite of ancillary tools which we can use to look at the data.

Last example, ADNI — Alzheimers Diseaes, naturalistic study of AD progression. About 800 individuals with a variety of different techniques. Data is immediately released (same day often). Are about 90,000 images in the database; Downloads are highly periodic (not sure why!).

Data needs to be sufficiently well described, with integration across different datasets.

What works and what doesn’t. First, data — the data must be describable enough so that they can be understood. Second, the experiments need to be coordinated or they hard to integrate. Tools must be good. Needs to be a good focus. Size: the data needs to be big enough to have statistical power. Duration: databases must last, so must have enough funding. Mission: is it well enough defined. People: common purpose and leadership to carry forward. Sociology: do people agree what should be shared and when. Expertise: need this. Funding: need sustainability.

This is a live blog from Neuroinformatics 2009.

Neuron systems are incredible stable over time. Looking at a number of systems, including pyloric pattern generator — stomachs in crabs (?). Is a pacemaker system; it’s very stable between individuals and over time. Despite this, for example, the maximal conductance in the neurons varies pretty widely. How come this variability doesn’t affect stability?

Have generator a single compartment model, looking at 8 dimensional parameter space and making a big database. Trying to replicate the variance that they see within the biological systems. Tend, with their model, to get similar output for these different conditions. Conclusion of this is that neuronal system have a large solution space within which they maintain their functioning.

Question, how are these solution spaces distributed within the total parameter space? Could be a single unique solution, could be islands, etc. Been collaborating in high-dimensional space visualisation using dimensional stacking. Think it’s a slices through the dimensional space, stacked out side-by-side.

Talks about cell-type specific co-regulation of ion channels; there are lots of correlations between different forms of current. Interestingly, most of the relationships are linear and, so far, all of them are positive; it’s not clear that there are any negative correlations.

Have classified their model space. Found that correlation between channels is always in the same direction — a correlation which is positive in one cell type will never be negative in another. However, they are showing a negative relationship in some circumstances, when the experimental processes show a positive relationship which is an open question.

This talk was given as a keynote at Neuroinformatics 2009.

This is my first attempt at live blogging a conference talk, so please read it in this light.

There is an overlap between neuroinformatics and bioinformatics; one example of this is the necessity for data integration between the two. Looking to the future; suggests that every database will have a canonical atlas; high-throughout measurements; dynamic live-brain imaging and mesoscopic biology; relationship to disese and pathology.

First step was taken by Allen brain atlas, to expression of genes to atlas to be of any use. Altas is now linked out to pretty much everything else; mostly through genes and gene IDs.

Example of systems biology approach to prion disease — injected prion into a variety of different mouse backgrounds. Looked for changes in expression in many different genes. Are a number of factors affecting prion disease; distinct prion strains cause different effects in the same background.

Highlights the necessity for standards in mass spectrometry if you wish to make quantative comparisions. More generally, this allows integration from many different data types, producing extension descriptions, for example, of a macrophage response.

Building a big integrated database of lipid metabolism.

Looked at oxidative stress in endothelial cells; again, did this by integrating knowledge from many different forms of experiment.

Next gen sequencing, ChIPing and digital gene expression. ChIP is massive sequencing of immunopreciptated chromatin DNA. Requries no PCR, so no amplification bias which is a problem for repetitive DNA.

Molecular imaging in vitro and in vivo; again gives a set of examples of where this is being used in xenopus and human; suggests relating fMRI data to genomic and other data will be the next big challenge.

Molecular modelling is also useful for integrating data. Gives set of example,s including calcium control within the cell. Were able to reproduce Calcium profile of many different gene knockouts and knockdowns from this sort of model.

One of the questins was:

How much data can you share — answer, all of it, with metadata if you want it to be useful!