Archive for the ‘Science’ Category


Abstract

A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge, during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over 100, 000 sentences. Over 8000 sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately 30% are erroneous, whilst 35% appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website.

  • Michael J. Bell
  • Matthew Collison
  • Phillip Lord


Plain English Summary

There are many database resources which describe biological entities such as proteins, and genes available to the researcher. These are used by both biologists and medics to understand how biological systems work which has implications for many areas. These databases store information of various sorts, called annotation: some of this is highly organised or structured knowledge; some is free text, written in English.

The quantity of this material available means that having a computation method to check the annotation is desirable. The structured knowledge is easier to check because it is organised. The free text knowledge is much harder.

Most methods of analysing free text are based around “normal” English; biological annotation uses a highly specialised form of English, heavily controlled and with many jargon words. In this paper, we exploit this specialised form to infer provenance, to understand when sentences were first added to the database, and how they change over time. By analysing these patterns of provenance, we were able to identify patterns which are indicative of inconsistency or erroneous annotation.


Abstract

Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

  • Phillip Lord
  • Lindsay Marshall


Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.


Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord


Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous (also to myself) at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license. Reviews may not relate to the latest version of this paper.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

Review 2

This review is by Bijan Parsia.

This paper presents a toolkit for OWL manipulation based on Clojure. The library is interesting enough, although hardly innovative. The paper definitely oversells it while neglecting details of interest (e.g., size, facilities, etc.). It also neglects relevant related work, Thea-OWL, InfixOWL, even KRSS, KIF, SXML, etc.

I would like to seem some discussion of the challenges of making an effect DSL for OWL esp. when you incorporate higher abstractions. For example, how do I check that a generative function for a set of axioms will always generate an OWL DL ontology? (That seems to be the biggest programming language theoretic challenge.)

Some of the dicussion is rather cavalier as well, e.g.,

“Alternatively, the ContentCVS system does support oine concurrent mod-ication. It uses the notion of structural equivalence for comparison and resolution of conflicts[4]; the authors argue that an ontology is a set of axioms. However, as the named suggests, their versioning system mirrors the capabilitiesof CVS { a client-server based system, which is now considered archaic.”

I mean, the interesting part of ContentCVS is the diffing algorithm (note that there’s a growing literature on diff in OWL). This paper focuses on the inessential aspect (i.e., really riffing off the name) and ignores the essential (i.e., what does diff mean). Worse, to the degree that it does focus on that, it only focuses on the set like nature of OWL according to the structural spec. The challenges of diffing OWL (e.g., if I delete an axiom have I actually deleted it) are ignored.

Finally, the structural specification defines an API for OWL. It would be nice to see a comparison and/or critique.

Drinking coffee in Italy is a quite different experience from drinking coffee in many UK coffee shops. In Italy, first you go into a bar — ”bar” in Italian doesn’t really have a direct translation into English, as it’s not the same thing as British pub, although they do have large and impressive counters — the bar itself. The person behind the bar is called a barista, which is Italian for “barman”. The barman is normally casually dressed. Assuming you want a coffee rather than food, you ask for a coffee in Italian which is, of course, the local language. The barman will turn around, fiddle with the coffee machine for a moment or two, give you a coffee and then take the 1 euro or so that is the normal charge. Most people drink this at the bar, without sitting down.

In the UK, you enter the coffee shop experience; the shops are often quite large, and involve sofas. The shop assistant is not a shop assistant but a “barista” which is not English. Baristas are, of course, trained and have the stars on their name badge to show it. You will ask for what you want, which you will describe also not in English, such as a “skinny, grande latte” which is Italian for, well, actually very little. The barista will fiddle with their machines for several minutes — thump, thump, thump to clean the old grounds, tsch, tsch, tsch to create the new, clunk, clunk clunk — pssssss, ahhhh. The coffee will then be served, often with a sprinkle of chocolate patterned with a pleasing corporate logo. You will give them the 3 pounds which is the normal charge. They will stamp your loyalty card.

The coffee will fail notably to taste any better than in Italy.

The reason for all of this fuss is called market segmentation: in the UK, coffee is a luxury experience; in Italy, it is a drink. You need all of this additional fuss to validate the price that you are paying; otherwise, you would feel like you were being ripped off. The irony, of course, is that the fuss does cost to provide, so then the price goes up even more. In the UK, I rarely drink coffee, which is a pity as a coffee (or espresso as we like to call it here) is quite nice in the morning.

My experience with academic authoring and publishing is rather like this. The process is surrounded by an enormous amount of mystique and hard work which adds relatively little to the process, but whose purpose is to convince the author that it is all really important, and well worth the cost (either 1000 pounds or copyright assignment which ever is the case), and time.

So, which parts of the publishing process do not actually make the coffee taste any better. To think about this, we need to think about the point of publishing in the first place: what are we trying to achieve? The process runs something like this: I, the scientist, do some work, which generates some knowledge about something; I, the author, then turn this into a form suitable for communication; others, co-authors and peer-reviewers, help to check that this has been achieved; finally, it is published or made available to the world. Other scientists then read this and the world becomes a better place. The last part is, of course, an aspiration and not always a reality.

This process is actually very simple. In fact, it is so simple that I achieve most parts of it with standard technology such as the WordPress installation producing this page. Peer-review can be easily layered on top of this as we have with KnowledgeBlog (http://www.knowledgeblog.org/). I am still in two minds about whether I value peer-review. It can be valuable scientifically, but in many cases it boils down to comments about how the reviewer would have written the paper; like most scientists, I am careful about my work, and get others to check much of it before I publish. And even when it does add value, it can slow down publication enormously, sometimes to the extent that publishers appear to wish to finesse the issue (http://svpow.com/2012/10/03/dear-royal-society-please-stop-lying-to-us-about-publication-times/).

Now, Chris Surridge recently characterised open access as “discovering what [we] can do without” (http://www.nature.com/spoton/2012/10/the-futures-bright-the-futures-orange-2/). I take a rather different view of things; I see this as an opportunity to actively rid ourselves of some baggage. What things do I actively not want from the publishers, though, rather like the bumping and banging in a coffee shop.

So, here is my list. I am sure that most academics out there could easily come up with their own list.

PDF
Year ago, I used to prefer paper over screen. Slowly and painfully I changed. I have now reached the same position with PDFs. Millions and millions of people use the various PDF viewers every year, to view millions of documents. But billions use the web. The difference shows.
English Correction
A quick look at this blog will show that my English is not perfect and I make typos. But my English is understandable. It’s enough.
Grammar Fascism
No, experiments are not performed. Sentences are not improved by use of the passive voice. For that matter data is not plural. These things are personal decisions. Get over it.
Dialect Correction
I wrote code to convert between British, Canadian and American English. Why did I do this?
Bibliography Styles
“Endnote offers more than 5,000 bibliography styles“. This is good. How?
Forced Structure
But you have to have a material and methods, because everybody really needs to know where you buy your computers and how much RAM it has.
Issues
Even the Royal Society is going continuous, so surely this is the way forward.
Complex Submission Workflow
Obviously, this increases quality, particularly if is different from everyone elses.
Submission Templates
Badly written LaTeX, or a dodge Word template. Totally different from everyone elses. With minor quirks. Oh, and instructions not to write any new commands in your LaTeX. What about Sweave? Pretty much, no you can’t have that.
Colour Image Charges
Colour is not all I want, I want movies (http://www.russet.org.uk/blog/2189)
Type Setting
Given that I have just spent three weeks writing a paper and checking that I have got it all right, why have someone type it all out again, and then ask me to check that they have done it right? Especially when they often haven’t.
DOIs
I’ve argued about these before (http://www.russet.org.uk/blog/1849), and no doubt will do again.
Impact Factors
Statistically illiterate (http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/). Irritating.

Now, of course, we cannot lay the blame for all of these at the door of the publishers. Passive tense, for example, is a pretention that gets enforced at all most all levels of science. And there are also failings which are a nasty combination of all of these; the decision of an individual scientist to save results in cognito for a bigger paper comes from a nasty combination of Impact Factor and its uses, page limited publishing and the complexity of the submission workflow.

The move toward open access, though, is a backdrop. The publishing world has changed for me substantially, and I now have a strong base line. arXiv (http://arxiv.org/) does nearly everything that I want; it is easy, straight-forward and rapid. Were it not for my first pet hate (PDFs) then arXiv would be everything I need (I know arXiv doesn’t take PDFs but it does expect papers to be, well, paper shaped, so PDFs is how you see them). Publishing in this way, using my own CMS, is the most pleasurable of all; I have modified the environment to fit me and that makes things easy.

It is still not ideal; in my case, I maintain my own server which comes with a high (time) cost particularly when things go wrong (http://www.russet.org.uk/blog/1939), and it would be nice to be rid of this hassle. But I value the ability to add to the environment, with tools like Kcite (http://knowledgeblog.org/kcite-plugin).

I do not know what the future of academic publishing will be; what I do know, is that I want the process to be as easy as this, and I can see no reason why this should not be achievable. A one euro coffee bar with no fuss near my office would be nice as well.

Updates

I have made some spelling/typo corrections.

Addendum

I forgot to add another of my pet hates. As a reviewer, I absolutely loath getting manuscripts with double spacing and the figures at the end. Or, worse, in a different file. Why? This just makes it hard to read. It is painful for the authors (unless they are using LaTeX, when it’s one line). I doubt that it even helps the publishers these days. And I don’t care even if it does.

Bibliography

Recently, I was surprised to be told that we could not have colour figures in our paper (http://www.russet.org.uk/blog/2170), even though it was online only. Our assumption is that this is an enormous legacy issue; the publisher in question, OUP, still produces a tree-based version of the journal, Bioinformatics. The distinction between colour and monochrome is important here.

Of course, it is easy to criticise others for being trapped in a legacy situation. The reality is, though, it can to happen anyone; it is not possible to always take a step back, to reassess standard procedure, to think whether it still makes sense. The paper based publication process still affects all of our ways of thinking and this includes myself.

The paper in question contains graphs showing a time course, in this case showing an analysis of all the versions of Uniprot/Swissprot from the first archived version. For our print version (now available at arxiv (1208.2175)), we used a four panel graph (yes, in colour), showing the first version, the last, and two in the middle.

However, while writing the talk for ECCB12 we realised that what we really needed was movies; animated images showing the change over time. It actually takes us much less screen space that a four panel display and displays the results very clearly. He has also published the same images online (http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=610). In this process, he has managed to achieve with a simple web page what conventional scientific publishing cannot. Both colour and motion. He has also shown us how bound we are by the paper based legacy of scientific publishing; this is how we should have presented the data all along, and it should not have required a talk for us to realise this.

Bibliography

Over the last couple of days, I have moved the host machine for this blog; many thanks to Dan Swan who provided me with pro bono hosting since I moved to WordPress in 2009 (http://www.russet.org.uk/blog/2009/05/new-day-new-blog/). As far as I can tell at the moment (assuming I do not discover anything broken), the move has gone very seamlessly. The only tricky bit is testing the new site — the solution here is to fiddle with /etc/hosts to point the test client at the new site, without having to change the DNS. When I do change the DNS, everything should be ready.

I have taken the opportunity to do some house-keeping at the same time, as I could this out on the new, non-live version. The riskiest thing that I have done is to move the permalink structure to a semantics-free version. I have tried this before with variable success (http://www.russet.org.uk/blog/1908). This time, WordPress appears to be behaving like a good citizen; the old permalinks are still working, but now forward to the semantics-free versions. If it all works, this will clearly have been a good thing to do. So fingers crossed.

Update

  • if you are reading this, then it has all worked.

Bibliography