Archive for the ‘Science’ Category

Okay, so I am totally sad and writing a blog post on Christmas day. Well, the thing is that I’ve been teaching for months and moving house. This is the first still period that I’ve had for ages; well, thinking is inevitable.

One of the things that I am looking to next year is the last ontogenesis meeting. It’s been a lot of fun doing these, I’ve enjoyed them all. The last one is my idea, and I think it’s going to be good. As an ontologist, you get a lot of questions about how to build ontologies and is there a book. At the moment, there isn’t really one and it’s a problem. So, for ontogenesis, we decided to write a set of book chapters; here is the clever bit — we just stick them on a blog, because the process of formal publication as a book is long-winded, tiresome and error-prone. I’m calling the process knowledge blogging — it’s peer-reviewed, formal and with no intention of being regular; articles come when they are written.

I set up the blog sometime ago. I haven’t, as yet, had a lot of time to fiddle with theme or organisation. There is some content, but it’s just the wordpress default theme. Not ideal, and I hope I will have some time for fixing things after I get back from holidays. I’ve noticed two problems already though. First is that with longer articles you need section headings and wordpress doesn’t do them; I’ve found a solution for this, in the shape of a contents table plugin, although subsequent googling also came up with others. This should make navigation a bit better.

The other issue is references — I don’t have a good idea about how to do these sanely. I’ve been looking for DOI wordpress plugins, but can only find one from crossref which doesn’t do what I want. This allows you to search for citations; what I wanted was to put a DOI in code and have it present properly.

Still I think I know how to do this; I’ve found a tool for linking references to the Mormon books; not normally something I would download, but the principle is the same. So I can replace DOIs with a proper link, using a DOI resolver. What I’d really like to do is have a proper in-text citation also. The documentation on DOIs and metadata harvesting is all rather nasty though; a nice simple REST API would do the trick.

It all confirms my long-held concerns about DOIs; there are a tool for the publishers. Still, perhaps pubmed will come to my rescue. Next place to look.

Happy Christmas to all my subscribers of whom there are very few.

Fours days of ontology bashing at an OBI meeting; this leaves me extremely glad to be going home. The meeting was long, hard and tiring. We got a lot done in the time available, though, and that was impressive. All the people in the room knew what they were doing, and we managed to work together and in parallel to an impressive extent. Even while listening to the main conversation, most people we also skype chatting about something else to those in and outside the room.

I spend a considerable time working on the paper, which will accompany the release. I got this job, mostly as to regularise and clean up the English, but in the end did rather more than this; I hope people are not upset about the stuff that I took out; the whole thing was done “pair programming style”, although I had different pairs for different sections.

Despite all the efforts, though, there are still tracker items open for the 1.0 release, and thats not ideal, but it is good that we are much closer to it.

Philly was much as I remember it; it’s a reasonably pleasant city. It doesn’t feel too aggressive and it’s relatively quiet. As I had a late flight out today (meeting finished yesterday), I spent the time wandering around town; like too many US cities Philly has been built to be easy to drive through, rather than good to live in, but you Philly is okay for walking around. They have a nice parkway area on JFK boulevard; I had a nice guided tour around the Rodin museum, which was wonderful, even if lacking The Thinker which is normally their show piece entranceway sculpture. Rodin was big on hands, it turns out, and rather fond of the musculature of backs; the captions on the bronzes suggest that he was having affairs with many of his models, so I wonder if this stems from…well, you can work it out.

After that, I wandered up to the art museum past the twee statue of Rocky Balboa, and the converse footprints sculpted in the stairs. The art museum itself is huge; the Thinker is temporarily here, so I got to see it after all, but I think it needs to be outside. As well as the traditional galleries, and strangely, they also have a lot of furniture there, and have imported whole rooms from various places. For me, the Asian section was the best; they had an Indian temple, dark and brooding in the half-light, and a Chinese room with the most amazing timbers. I felt the indoor Romanesque outside courtyard (erm…) was taking it a bit too far.

Not much left to be done after an afternoon full of culture; on the way back to the hotel, I looked for a little park on Chestnut that I had wanted to see and a falafel shop which I had seen sign posted. I found neither; the park had left no traces at all, the falafel shop I found a poster for, but I walked all the street and as far as I can tell 1740 Sansome is a multistory parking lot.

Back where I started, sitting in the airport; tick, tock, tick, tock.

Bleary eyed, stacks of chocolate muffins obscuring the “healthy snacking” sign, kids on heelers. Yes, I’m in the airport at stupid-o-clock on saturday morning. I’m heading out to Philadelphia for an OBI meeting. It’s an important meeting; OBI has been a long time in gestation, but this should constitute the 1.0 release; it’s going to be a mass tidy up session.

I’m quite looking forward to it, in some ways. I quite like Philadelphia, at least if my memory serves me well; I’ve only been there once, for the SOFG conference many, many moons ago, certainly in my pre-blog days. I remember it as a pleasant town, with a water-front, only slightly scarred by the enormous roads that make US cities less livable than European. I’m also hoping to catch up with Robin McEntire, who was one of the co-chairs of Bio-Ontologies at ISMB, and is local.

I’m rather unprepared for the meeting. There has been a lot of activity on the mailing list recently, some of it concerned with paper preparation. But I’ve been trying to get the rest of my teaching preparation finished (nearly done now) which has left me very busy over the last few weeks; I haven’t even had time to look at the paper; I’ve hardly read even the mailing list subject lines. Still, the next week is entirely given over to OBI, which will have to be enough. Travel at this time of year messes with my life to an extent that I’m certainly not going to feel guilty about it.

The flip side of being busy, is that I am now in the process of writing about 5 papers, with the next 2 in my head. After the confusion of moving to Newcastle, working out what research to do and learning to teach, my research was getting a bit stuck; I was running out of ideas for the simple reason of not having time to think. Having an enormous backlog of nearly finished, half-finished, and hardly started good ideas (most of which will, in time, turn out not to be) for papers makes me feel like a proper academic again.

I was most entertained to read about EPSRCs funding policy changes. Basically, they have taken a long hard look at their system for funding, they have decided that the peer-review system has fundamental problems, and have therefore issued their well thought out and considered solution to the problem: blame the users.

Their idea is this; if you are on too many grants that fail, then you won’t be allowed to submit again until you have been on some sort of re-education camp. The basic criteria appear to be this: three or more unfunded proposals, ranked in the bottom half, and lower than 25% success over the same two years.

The first criteria is problematic because it is based on an aggregate score; it is impossible to judge in advance whether you are going to be in bottom half; your proposal could be brilliant and internationally outstanding (EPSRC is like Lake Wobegon, all the grants are above average) and you could still be in the bottom half. The second half of the criterion is also interesting; if you submit a single proposal and it gets rejected then you are fall into this category straight away. It’s also going to mean that it’s going to be harder to get people to do collaborative grants, as it might bring their stats down. This is after EPSRC have been pushing us for years to put at least 5 different institutions on each proposal if we want it to be funded.

At the same time, information about the REF which is to follow up from the wonderous RAE is starting to trickle out. Nice to see that they are still going to reinforce the existing closed publication system with more bibliometric data. The “You are the REF” website offers itself as a way to work out your score. Excitingly the first question is “What is your discipline?”; Computer Scientist or Biologist. This seems reflective of the REF documentation that I have seen already. It works on this basis: different disciplines have different rules, so we will make different decisions in each, which is fine, because no one can be in two anyway.

Glad to see that the REF is carrying on the RAE tradition of encouraging multi-disciplinary research.

At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.

In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.

While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.

OBO makes use of a numerical, semantics-free identifier and a namespace, with a syntax of NAMESPACE:IDENTIFER. So, a Gene Ontology term looks like GO:0003674. The namespace is not constrained to be two-letters and has mechanisms for world-uniqueness, in that people talk to each other and sort it out, if they clash. The use of a semantics-free identifier means that term names can be changed while maintaining the implied meaning with the term; the label for the term, meanwhile, provides a human readable version, which can be shown to users of the ontology. I will call these the OBO identifier and OBO label respectively.

Translating this, however, into OWL, including Manchester syntax causes significant problems. The naturalistic translation is to turn the OBO identifier onto the identifier in OWL; the OBO namespace would become an XML namespace, the OBO identifier would become an XML identifier. Unfortunately, this doesn’t work. First, the OBO identifier is genuniely just a short string and XML requires a URI; so a mapping between OBO identifiers and URIs is necessary. Second, the OBO identifier is numerical; unfortunately, while the identifiers in OWL can contain numbers they have to start with a non-numerical character. The standard translation, therefore, uses in most cases an OBO wide URL (http://purl.obolibrary.org/obo/), although some ontologies have their own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO identifier is mapping to an valid identifer by sticking a prefix onto the numbers. So, we have identifiers such as GO:GO_0042101 or obo:OBI_1110045. There are also some OBO ontologies for which this does NOT occur; for instance, BFO classes in OBI come out with identifiers of the form snap:Continuant or span:Process, except for one which is bfo:Entity.

Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:


Class: obo:OBI_1110161

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:


[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity

The cross reference in this case is a subsumption link to IAO:0000030

One solution would be to use the rdfs:label in place of the identifier. So, we would have something that looked like this:


Class: "T cell epitope ELISA IL-1b assay" @en

    Annotations:
        obo:identifier "1110161"

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

Other identifiers would also have to be changed, also. I’ve also added the odo:identifier line (which I think would be valid, but might require the creation of an OWL individual). Without this, it would not be possible to go backward.

However, this is problematic as it changes the serializiation between the OWL Manchester syntax and other syntaxes of OWL. The class identifier has to be URI legal, and OBO label here is not. We could do a syntactic conversion (e.g. T%20%cell%20%epitope) but this, again, reduces readiblity, defeating the point. Also, the rdfs:label would become part of the final identifier URI, which then becomes a semantics heavy identifier. Finally, it would require a OBO specific loading of the Manchester syntax, taking the URI identifier from the annotation block, and the rdfs:label from the class name.

So, is there any solution. First, there are tooling solutions. In Protege, it is already possible to use any component of the definition in the display. So, you can set the rdfs:label as the main display form. Tooling solutions are attractive, but there is a problem; you have to extend all tools to support this view; I realise that the number of freaks who wish to edit OWL with emacs is not that large, so this might not seem an issue. However, many people wish to develop ontologies collaboratively using version control; if you want to compare versions you use diff, so we now need an Manchester syntax diff viewer. Also, if you want to do some perl hacking, or straight-forward search and replace, again, it’s all harder.

To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.

So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
        obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
        some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
        and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
             some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
             ))

This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.

So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.


Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661 [blah],
        obo:OBI_0000299 [longer blah]
        some (obo:IAO_0000109 [more]
        and (obo:IAO_0000136 [stuff]
        some obo:OBI_11101 [OBI Thing]
        ))

This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.

Comments welcome!

This is the third year in a row that I have been to Neuroinformatics (or it’s forerunner, Databasing the Brain). It’s still turning out to be an enjoyable meeting, even though there is still lots of it that I don’t understand. Come to think of, perhaps because there is lots of it that I don’t understand.

Pilsen (or Plzen) is, perhaps, a strange place for the meeting. It’s a bit of a pig to get to, as the airport is in Prague. Likewise, the conference centre was a bit out of town, so you had to get a taxi if you wanted food in the evening. Still the venue itself worked well. Slightly flaky wireless, but it had tables upstairs on a balcony; a lot of people migrated up there as the meeting went on, making the auditorium a little deserted.

Although, I’ve said I didn’t understand lots of it, many of the keynotes this year were bioinformatics, systems biology or data integration which I know well. As well as that, there was a (semantic) web and ontology section. I enjoyed Tim Clarks talk, as he’s made stuff that lots of people are actually using, although I don’t think he explained why during his talk.

The section of high performance computing was probably the least relevant. While they’ve become interested in power consumption recently, these guys are still obsessed with teraflops (…now petaflops…now exaflops). To be honest, I don’t care. With more power, you can build more granular, higher resolution models, but I doubt that will bring you anything, unless you also have more granular data. They should be worried about discs — always the Cindarella of the hardware world, only slightly more interesting than printers — but it’s discs which carry the data. While we are at it, spinning discs use lots of power. And they have more flashing lights than CPUs. The hardware guys should be talking about disc space. The neuroscientists should be worrying about filling discs up. Neuroinformaticians should make that they end up with an exabyte dataset; not 1000 petabyte datasets or worse, 1,000,000 gigabyte datasets.

I tried to get a bit of Web 2.0 stuff happening at the meeting. David Sutherland set up a friendfeed room. Second day, we were sitting next to each other like two sad blokes at a party full of women, sending each other messages on their iphones. Although, it was a neuroinformatics meeting, so largely without the women. Second day, mostly it was just me, sad, lonely and pathetic. Still, having said that, I did manage to meet almost all of those subscribed to the room, which you couldn’t achieve at ISMB nowadays. Pavan Ramkumar said hello at lunch, and then later at the airport. I met Sarah Maynard at her poster; it had ontologies, OWL and information content-based similarity measures; bound to make me happy. Only Lisa Kjonigsen remained in cyberspace only. With luck, next year, more people will join; not least because I’ll probably not go to Japan.

I had a quick go at live blogging also; to be honest, I am not a natural. The problem is I have too much desire to editorialise. The roboblogger tells me that she just blogs the notes that she would have taken anyway; my notes, on the other hand, are full of comment, invective and questions. Perhaps I could just put these into the asciidoc source of my blog as comments. I stopped live blogging on the last day, not for these reasons, but largely as a desire not to hold my crushing ignorance of the topics being discussed up to public scrutiny.

Neuroinformatics (the meeting) is changing. I have to believe that if there is more about genomic and multiomic data integration that this has to be a good thing. The brain is a hard to thing to figure out; I have to believe that using more data, more types of data and a heavier use of nice, simple, model organisms is going to increase the rate of advance; with all the fuss about systems biology, it’s easy to forget the fabulous success of the last 100 years of reductionism biology, which made systems biology possible. This has to be the way forward for neuroscience. Even if it does make the meeting more usual and, perhaps, less interesting for me as a result.

This is a live blog from Neuroinformatics 2009.

Data management: View from 50,000 feet — dimensions are amount of structure and the number of data sources. More structure, less data sources.

Distinguishes between parallelisation and heterogeneity. Can distribute data across tables in an organised way — this is parallelisation; or, you can have lots of data, spread across resources, with multiple entities and with no common plan.

Outline — data integration and suggest data spaces as a solution.

Databases are so successful because it provides a level of abstraction over the data. Data integration is a higher level of abstraction still because you don’t have to worry how the data is stored or structured.

Mediated schema, uses a mediation language, a mapping tool, and then a set of wrappers over the datasources, which map them to a common syntax (relational database for example).

So, we know how to do it, but the cost of building data integration systems are really high. Creating the mediated schema or ontology is hard; sometimes it’s impossible. Mapping source to mediated schema can be a nightmare, because you need many people from both sides of the mediation. Are some automated systems, but human is always needed. Data level mappings (changing IDs, synonyms and so on). Social costs.

One of the problems with data integration is that it costs a lot early, but yields very little till quite a long time on, and it’s all done. What we really want is pay-as-you-go data management; want useful data out early and constantly.

Everytime human does something with data, they are telling you some information about the data. If you can capture this information then you can useful stuff with this.

Structured data on the web: the deep web, which is data behind forms; and two others. So, deep web. Knowledge which is not accessible through general purpose search engines — cars, houses and so on are examples of this. Uses data spaces as a way of doing this; learned different 5000 data sources in two months.

One possible way to access the deep web is to put queries against web forms. Have to guess what to put in; one way is to just use words on the form page in the first place. Currently, google gives much knowledge from this deep web; has the biggest impact on the deep web.

Web tables; can we exploit the knowledge from the tables better. There are 14billion tables on the web, of which about 154million are interesting — rest formatting or whatever. First problem is to identify schema elements; these are expressible in HTML but actually no one uses it. So have to guess. They got 2.6 million schemas. Would be good to put these into automcomplete (although not sure where).

Fusion tables lets you upload data and collaborate on the visualisation of it. Changes the visualisation options depending on the data types.

Conclusions — bottom up data-integration, which is more realistic than top-down. Dataspaces are an approach. Fusion tables is good.

Guiding principles of NIF. Builds heavily on existing technologies. Information resources come in all sorts of size and shape.

Highest level NIF registry. Web index of resources which are relevant to neurosciences.

NIF resource diversity — three different levels of data, with increasing amount of structure.

Is GRM1 in cerebral cortex? NIF system allows searching over multiple different resources. But problems; inconsistent and sparse annotation of scientific data. Many different names of the same thing and so on. Added to this there are over 2000 databases in the registry.

Uses mixed searching so that both ontological information and string based systems important for where there is no annotation. Can also do query expansion with ontology to get better querying.

Building ontologies is difficult even for limited domains, never mind all of neurosciences. Trying to do this with multiple levels. NeuroLex — single inheritance, lexicon. NIFSTD, standardize modules under same upper ontology. NIFPlus — create intra-domain and more useful hierarchies using properties and restrictions. .

Using logical classification as a result of properties of the entities.

Question — how to get the community involved. Need to provide an easy to use platform for community collaboration. They have a semantic wiki for contributing to neurolex. Really lowers the barries for entry for domain experts who wish to use (and extend!) these terms.

Lots of people are starting to use the resources (they find this out because people complain when the systems are broken!).

Contributing to Neurolex. Don’t need an account, but better if you have one, everything online. Many thing that they are looking at is content, content, content. More stuff the better. Finally, getting people to value ontologists is really important.

This is a live blog from Neuroinformatics 2009.

Motivation, what is the common feature of a set of disorders. They are all complex disorders, which we don’t really understand.

Alzforum is a nice example of an early web community. Alzheimers forum. Works as an ongoing journal club, with curated discussions. Started off during the early days of the web.

Developed StemBook which is an online book, launched about a year ago. Discussion of stuff that is happening. pd online research, is another alzheimers website, using a toolkit that they have developed. Linking across these forums can be a problem; need some forms of shared terminology server. Science Collaboration Framework. Based around drupal, allows common collaborative tools for biomedicine, shared ontologies/vocabulary and so on.

How do you link between these communities? Issues of semantic annotation; how does this happen? Are systems which allow you to guess what an ontology is; building system which should work across lots of different content management faciltiies. This can bring lots of benefits, as the additional semantics allows you to work around synonyms etc.

Discourse ontologies. SALT — semantic annotation of latex.

Need to support a spectrum of different knowledge structures, theasurai and so on. Less complex == more tractable to biologists. Complex and formalised, tractable to computers.

Are now integration discourse ontology into myexperiment and others.

Using existing work on entity recognition and try and produce a provenance aware representation of these results.

This is a live blog from Neuroinformatics 2009.

Creative Commons is based around issues with data and copyright, trying to change the idea that not sharing is the default. Science Commons looks at the issues specific to science.

Semantic web in a nutshell; adds to web standards and practices encouraging, common naming, ontology development, expression in knowledge representation language, easy integration over multiple sources, works both inside and outside the organisational boundaries.

Why should you want this? Network effects, people can use their own skills, and combine knowledge from many different sources. Provides efficiencies at the global scale.

Copy and paste for the semantic web; a mashup with knowledge from Allen brain institute, and google API. Had to screenscrape Allen brain for this.

Trying to look for druggable targets in pyramidal neurons. Google provides too many results, so does pubmed. Shows complex SPARQL query over the knowledge from the web; crossing from MESH to gene to GO. This may not be the best query, but it’s none the less useful and will make biologists happy.

A brief jump into ontology making. Terms that mix up material and neurotransmitter. Uses example, peptide, neurotransmitter, hormone and ligand; all of these could be peptides, although not necessarily. Need to untangle these. In many cases, these have already been done (ChEBI). Move from English to OWL.

How to build consensus in ontology building — somewhat related to OBOFoundary rules. Another program is INCF program for ontology of neural structures.

Challenges — building bigger ontologies is hard. Barrier to sharing are a major difficulty.