Archive for the ‘All’ Category

LaTeX to WordPress

Phillip Lord

This post describes the process of posting to WordPress from a LaTeX source file, using tools generated as part of the Knowledgeblog project.

1 Introduction

About a month ago, we managed to get funding from JISC for knowledgeblog; the idea is to turn a blog platform from something for light commentary into a framework for serious scientific publication. One of the key requirements for this is to fit in with peoples existing working practices; and for this, we need a good document creation environment. This means word and latex. I’ve been working mostly on the latter, and this post is the first outcome. It’s generated totally automatically from latex. This is an advance on my paper on realism which was semi-automatically converted, with some hand editing of the HTML.

At the moment, the tool-chain is a little bit clunky, but it will improve! This is not meant to be an annoucement that all is ready, just an early alpha release and proof-of-principle.

2 Implementation

The implementation of these tool-chain uses three pieces of software:

latextowordpress:

This package, that I have written, uses plasTeX to parse and render the latex into HTML. Most of the work is being performed by plasTeX out-of-the-box, although using a non-default configuration. Math-mode is being treated separately however, rather than using plasTeXs default image rendering approach.

blogpost:

blogpost is being used to actually post the generated HTML onto the web. The HTML can also be cut-and-paste directly into wordpress, but blogpost is easier for me, as its the usual tool I use anyway (normally over asciidoc source). Blogpost is unmodified.

mathjax-latex:

This is a wordpress plugin, that I have written, which uses MathJax to render math-mode from the original latex in the browser. The plugin just injects the mathjax javascript headers into a post on-demand (i.e. only on posts with math-mode in them).

Currently, this is all held together with some dodgy makefiles; this will be improved in time.

The first and last of these tools are available from knowledgeblog. I’ve tested them on Ubuntu 10.04 and they are in alpha. Comments are welcome, to knowledgeblog-discuss.

3 Key Features

At the moment, I haven’t fully explored all the features of LaTeX that are well supported. However, all the structural elements (sections, lists), bibliographies, links via the hyperref package all seem to work well.

The math mode rendering works well. I’ve been using one famous equation: \(E=mc^2\), as my main test. But more complex examples work also. This is from mathjax:\(J_\alpha (x) = \sum _{m=0}^\infty \frac{(-1)^ m}{m! \, \Gamma (m + \alpha + 1)}{\left({\frac{x}{2}}\right)}^{2 m + \alpha }\).

I’ve made a few tweaks to this also for common idioms. So the lesser than symbol is written in mathmode in latex but rendered directly in HTML: <.

4 Future Work

There are many things left to do yet. The process needs to made smooother, with a single tool to hook the current tool-chain together; it would be good to attach a PDF generated from the latex also. Currently, titles are set independently (which is why this post appears to have two titles). The mathjax plugin needs configuration options (it overwrites wp-latex functionality at the moment). And there is significant testing to do to see what advanced features (figures critically!) work and don’t work. Still, it’s good to see that most of the tools that I needed to get this work already existed. With luck, most of the other tools we need will be as good.

I don’t normally use my blog to engage in conversations the way that some people do. I already spend enough time on mailing lists, so using the blog seems redundant for this. However, I will change the habit of a life-time this once, because of an interesting discussion on institutional repositories, which I have previously written about myself.

To me the difficulty with institutional repositories is this. First, they are a resource. Then, some one says, this is good, everyone should do this. Then, someone else says, hey this is great, we could use this for our RAE (REF, whatever) return.

Now, you have to deposit things in your IR. But people object, on various “data is mine” grounds, so perhaps they make the IR non-public. The data model gets tweaked with various additional data (which school, who your line manager is) necessary for RAE. At the same time, your co-authors also have to deposit into their IR. And, if you move, you have to type your entire back catalogue into various repositories for your new institution.

Currently I am supposed to deposit papers in various IRs, including at University and school level. As well as add bibliographic information to various databases. And, then of course, project wiki’s. And the funders want the information in various databases. All of which is very time consuming, produces highly duplicated, and often error-prone data. In short, it’s a bad thing.

The irony is, if you google for any of my papers, the main source from which they are scraped is my website. I set this up myself many years ago now; it’s a simple bibtex to HTML thing (actually not so simple nowadays — it grew over time). So, the simplest and most straight-forward solution, also turns out to be the best. The most important thing is this; the bibtex files are the ones that I use, for my own work, for citing myself (which, like any good scientist I do as often as possible even when the citation is largely irrelevant). The website is what I use, when on the road to get the PDF of my own papers; if I want to give a reference to someone, I’ll email a link to my website. So, I keep it upto date, because it’s in my benefit to do so.

We need a few simple and easy to use standards for bibliographic data. It has to be simple, because it needs to fit in with peoples’ current work practices; this means it needs to be supported by a heterogenous environment, by many different tools. And it’s won’t be, if the standard is hard to develop against.

For data, of course, the issues are somewhat different. Mostly because data needs more structure than human-readable information, and because the data is often large. However, two issues remain: first, we still need to fit with peoples working practices; second, with data, engaging in the institutional football we see with bibliographic data, will still be a bad thing.

Again, simple data standards are what we need. After that, people will choose whatever they choose; the data standard will be enough to bring it all together in the best way that we can.

 

I’m very pleased that our grant for knowledgeblog has been accepted by JISC. I shall follow the tradition that I set with my last post, of publishing all my primary scientific output on this blog. In this case, I’m using Word, which like the latex that I used last time isn’t perfect. Still improving this process is part of the knowledgeblog proposal, so this post is also attacking a key deliverable for the grant!

The main content for this post is also available on the knowledgeblog events blog.

 

Outline Project Description

The project extends existing blogging tools for use as a lightweight, semantically linked publication environment. This enables researchers to create a hub in the linked-data environment, that we call knowledge or k-blogs. K-blogs are convenient and straight-forward for authors to use, integrating into researchers existing work practices and tools. The provide readers with distributed feedback and commenting mechanisms. We will support three communities (microarray, public health and workflow), providing immediate benefit, in addition to the long term benefit of the platform as a whole. Additionally, this will enable a user-centric development approach, while showcasing the platform as the basis for next generation research publishing. 1. Introduction

1This document describes a proposal for a project within the JISC “Managing research Data” call. Data comes in many forms, from raw statistics, to highly structured databases, through to textual reports; natural language, although hard to search and manage, is still the richest form of representation; data in the form of reports and publications are the central hub around which all other data sit. This project, therefore, will provide a lightweight, yet extensible, framework for scientific publishing, incorporating a software-supported peer-review process. Bi-directional links will be maintained both between publications and to other forms of data, using semantic markup to enhance the meaning of these links. We will also customize this framework for three communities which, as well as being directly useful, will provide real-world requirements. The project will largely develop “glue” between existing, widely-used, open-source software systems, ensuring its sustainability and usefulness past the end of the funding.

2. Fit to Programme Objectives and Project Outline


2The project call identifies the complexity and hybrid nature of the UK research data environment; despite this, one central focal point remains — most researchers spend considerable amounts of time discussing their data in the form of “paper” publications. For some, more theoretical disciplines, such as parts of computer science, the paper is the sole output; in others, such as biology, datasets are associated with papers and the barriers between “publication” and “data” are breaking down; most data sources in biology are rich in annotation; text that supports and explains the raw data. It is normally the annotation, not the raw data, which defines the quality of the resource. In these cases, text is an intrinsic part of the data.

3However, the conventional publication process has changed relatively little; the adoption of web technologies have largely been used as a distribution mechanism. Publications are still expensive — either at subscription or publication time, depending on the business model of the publisher, and involve considerable, time-consuming interactions between author and publisher, often relating to display and presentation issues. This is in stark contrast to, for example, the biological data centres where both raw and annotated data are often made available within hours of their generation.

4This situation is unfortunate because it limits the ability of researchers to customise their publication process for the requirements of their own discipline. As demonstrated by Shotton et al, and Rousay et al, it is possible to add considerable value, both enhancing the paper for the reader, as well as providing direct and semantically enhanced links to underlying data. The cost of the existing process, however, makes this form of publication unlikely for some data; for example, few scientists publish papers about negative results, resulting in an acknowledged publication bias,. As a result, it is hard for the semantically enhanced publication to take its place as the central hub for a linked data environment as envisioned by Coles and Frey, linking to and between research datasets, and the published knowledge about these datasets.

5In the last decade, the blog has become a common, web-based publication framework. There are now numerous off-the-shelf tools and platforms for managing blogs, providing a high-degree of functionality. Many scientists blog about their work, about other published work (research blogging) or “live blog” about conferences and talks as they happen. In this case, the researcher is in-charge of their own publication environment, can extend it to their requirements, and publication happens immediately. However, the blog has not yet become a standard means of publication for primary research output.

6Recently, as part of the EPSRC funded Ontogenesis network (ref), we trialled the Knowledge Blog process; in this case aimed at producing an educational resource describing many aspects of ontology development and usage, which might previously have been published in book form. We have shown that with this technology base, it is possible to replicate many of the features of the open peer-review, scientific book publication process; following two small meetings, we have written around 20 articles, and the website maintains around 1000 post reads per month (not simple hits!). To achieve this, we used only two features of the blog — trackbacks (bidirectional links) and categories (hierarchical keywords); although we used the WordPress blogging software, these features are supported by most other systems. We call these articles k-blogs.

7Currently, however, the k-blog process is not fully supported with blog software alone, nor does it fully support the referencing, advanced linking and provenance needed specifically for research publications. For this project, we propose to provide extensions to support data-rich publications, deeply and semantically linked to other k-blogs and to other forms of data repository. Therefore, the project addresses the objectives and aims of the call through four main workpackages.

1) A documented k-blog process (WP1.1) describing different levels of  peer-review suitable for different forms of research data. An implementation (WP1.2), the k-blog platform, of these process based around open-source, off-the-shelf software.

2) Extensions to the k-blog platform supporting linking. This includes full support for referencing including COINS metadata on posts (WP2.1), client-side and permanently linked versions (WP2.2) and bidirectional links (WP2.3) to other data sets. We will add semantics to these links using the Citation Ontology (CiTO) (WP2.4).

3) Support for three specialist environments—healthcare (WP3.1), microarray (WP3.2) and workflows (WP3.3). All useful in their own right and showcasing the extensibility of the framework.

4) Documentation and tooling to integrate the k-blog process into scientists existing working practice and tooling; scientists will be able to publish from Word, OpenOffice, Google Docs or LaTeX (WP4.1). We will add tooling and documentation, as WP4.2, to support the use of reference management tools such as Endnote, Mendeley or Zotero, making use of deliverables from WP2.

3. Quality of proposal and Robustness of Workplan

 

3.1 WP1: Knowledge Blog Process



8In this project, we aim to develop a light-weight publication framework, including the desirable aspects of the formal peer-review process. However, different forms of scientific publication require different levels of peer-review. For example, for http://ontogenesis.knowledgeblog.org, we require two reviews from an editorial board, assessing quality, appropriate for an educational resource. However, for http://process.knowledgeblog.org, which is intended to contain informal “how-to” and request for comment documents, a much lighter-weight, single editorial review assessing scope alone is more appropriate. Deliverable WP1.1 will consist of documentation describing both formally and informally, a number of levels for the knowledge blog process, and how these can be achieved using a blog. These documents will, themselves, be published on http://process.knowledgeblog.org.

9These processes will be implemented as Deliverable WP1.2, comprising freely available and widely used pieces of software, with additional “glue”. The basic publication framework will use WordPress 3 (WoP) — an open-source, multi-site, multi-author blogging system used to provide the hosted blog service at http://www.wordpress.com. While, we have found that WoP supports many aspects of this process, particularly from the readers perspective, a significant degree of “book-keeping” is required from authors, reviewers and editors. Readers know whether a paper has been reviewed or not, but authors have to remember for themselves who is reviewing the paper. Therefore, we will use a “ticket system”, specifically Request Tracker 3 (RT) (http://bestpractical.com/rt/). Both WoP and RT are extensible with plugins and will be extended and adapted to reflect the k-blog levels of WP1.1.

10We will use this extensibility to provide a light-weight integration. RT operates as an email response system; by extending WoP to send email on submission of new papers, this can provide both an integration point, as well as the main point of interaction for authors, reviewers and editors. To provide editorial and reviewer functionality tickets can be moved between queues; extensions to RT will use standard blogging XML-RPC calls to feedback to WoP by, for example, re-categorising papers once accepted. OpenID (http://openid.net) will be used to integrate the user accounts between the two systems. WoP already supports this fully, while RT supports it in skeleton form.

11Although we will provide an implementation of the k-blog process, it will be described sufficiently generically to support complete and independent implementation.

 

3.2 WP2: References and Metadata
12For k-blogs to become an integral part of the scientific record, they must fully support the semantic and linked data environment. Although WoP supports standard URI based linking to resources, and bidirectional “trackback” linking to other resources, it lacks complete functionality suitable for research communities. This is a rare example of functionality that is not already provided by WoP or an associated plugin. Deliverable WP2.1 will fulfil this need; we will support the insertion of at least DOIs and PubMed IDs (PMID), that will be resolved to full human-readable reference lists for display, using APIs provided by CrossRef and NCBI eUtils respectively. To fully support computational agents wishing to access the same information, references will also support COinS metadata, embedded into the display HTML.

K-blog posts will also require outward facing metadata, that describe the resources they provide in a standards-compliant manner. The Open Archives Initiative (OAI) provide standards that aim to facilitate the efficient dissemination of content. Specifically, the Object Reuse and Exchange specification (OAI-ORE) is a standard for the description and exchange of compound digital objects  (such as a WoP post or page). The WordPress OAI-ORE plugin provides link header elements that implement this specification.

13Our initial investigations into the k-blog process showed that WoP support for versioning and provenance are lacking; the k-blog process involves updating papers after submission but before final acceptance. While WoP stores all these versions, these are only currently visible by authors or editors through the administration interface. Whilst existing plugins for WoP already provide some of this functionality, Deliverable WP2.2 will uncover these to readers, along with a defined permalink scheme for access to all versions, providing full provenance.

14WoP supports bi-directional links in the form of trackbacks; this is mediated by XML-RPC calls between resources when a link is made. This will support linking to data where, for example, the data is another k-blog; however, general data resources may lack support for this process. Therefore, as Deliverable WP2.3, we will provide a trackback proxy, hosted on the http://knowledgeblog.org server, storing and presenting these links for resources that cannot directly process trackbacks.

15To complete this work package, we will add semantics to the links using CiTO, as Deliverable WP2.4. Therefore, as well as enabling easier data linking and provenance, we will also enable addition of meaning to these links.

 

3.3 WP3 – Specialist Environments



16The k-blog platform and process is designed to be flexible and adaptable to the needs of specialist environments. We will use three main use cases to ensure real world applicability of the software, as well as fulfilling the immediate needs of these communities.

17For Deliverable WP3.1, we will add additional features for supporting the microarray community. Currently, the microarray community is well serviced in terms of metadata capture (MIAME) and deposition in public repositories (ArrayExpress, GEO). As part of WP2, we will support linking to these datasets through stable URIs. However, these resources deal only with data generation. Post-processing and analysis is largely captured at the publication stage, often in supplementary material.

18A substantial amount of this analysis uses BioConductor: a widely used, open-source platform for statistical microarray analysis based on the R statistical programming language. We will extend k-blog with specific support for R and BioConductor. Authors will be able to directly embed code into k-blog papers, along with the figures that result; as a result reviewers and readers will be able to see a computationally precise description of methods and replicate the generation of figures should they choose.

19Finally, we will investigate the possibility of publication to a k-blog using only R code and references to public databases, in a process similar to Sweave — figures will be generated on the server, provide guarantees of correctness and precise provenance. The limited scope of this call means this part of WP3.1 will be proof-of-principle only.

20For WP3.2, we will focus on the public health community (PHC): a key workforce in delivering quality and effective healthcare by providing timely and accurate public health intelligence (PHI),. PHI is a varied environment performing statistical analyses: producing information figures, diagrams and reports to communicate results to the wider health community. However, the PHC operates in small groups with little knowledge networking. The main aim of the k-blog is to improve the availability of health information, data and knowledge, to inform decisions for health protection and care standards as supported by the Quality Improvement Productivity and Prevention initiative. The NWeHealth e-Lab project, hosted at The University of Manchester, provides an environment to bring together research objects into a single location. As elsewhere, textual data forms the key hub that links together all the other forms of knowledge. By linking to e-Lab research objects from a k-blog, this link will be made explicit, available, interpretable and directly valuable to the PHC; as a result WP3.2 is synergistic with the rest of the proposal. This community also bring a set of access control requirements. To support these we will use existing WoP facilities, providing a simple, easy-to-use three level access model.

 

20For WP3.3, we will generate k-blog content about Taverna workflows and methods for building them. Workflows have become a popular way of realizing computational analyses and have become an important form of data. The JISC funded myExperiment project is widely used to disseminate the workflows themselves. Knowledge about issues surrounding workflows is, however, more difficult to produce and disseminate. A k-blog, with its ability to produce short, targeted articles as the need arises and the resources become available for writing, suits the need for taverna workflow documentation. We will seek k-blogs on Taverna issues such as: the basics of workflow design; how to choose among a set of similar services in producing a workflow; and, the testing of workflows. We will implement a light-weight mechanism, using trackbacks, to link between the k-blog and myExperiment.

 

21As part of WP3, we will also hold four workshops, at 3-month intervals, each focusing on one particular k-blog and community. These workshops will be of the form previously trialled as part of the Ontogenesis network, and will serve several purposes; requirements gathering and feedback for us, education for the community and development of content, that demonstrates the process to the general readership.

 

3.4 WP4 – Integration with Existing Working Practices



22For the k-blog process to be acceptable to communities such as those described in WP3, it must fit with existing working practices. Researchers mostly write documents using a word-processor. Fortunately, as the k-blog platform is based on the widely-used WoP, which in turns offers a widely-supported API, this style of working can be readily integrated. It is already possible to author using Word (2007 onward), OpenOffice, Google Docs and LaTeX using integrated or existing technologies, as demonstrated by our previous work at http://ontogenesis.knowledgeblog.org. For Deliverable WP4.1, user oriented documentation, describing these tools will be developed. This documentation will also describe clearly how to present and organise papers in a way which is optimized for the k-blog process. While, we expect this documentation to take a significant time-span to produce, refining it as a result of user feedback, it is important to note that a k-blog is already useful and possible.

To take maximal advantage of linking technologies developed in WP2, we will need to integrate with existing technologies for referencing. As deliverable WP4.2, we will add tooling to enable the use of bibliographic tools such as Endnote, Mendeley, Zotero or BiBTeX to insert references that k-blog can directly translate. Largely, this should consist of “styles”, modifying the in-text citation, as the reference plugin of WP2.1 will generate reference lists. As with other deliverables, this tooling will include substantial documentation, developed using the k-blog process.

4. Project Timeline

 

Name

Start

End

Staff

Notes

WP 1

02/08/2010

30/10/2010

   

WP 1.1

02/08/2010

31/08/2010

All

A documented k-blog process

WP 1.2

01/09/2010

30/10/2010

DS,SC

Implementation with off-the-shelf software

WP 2

01/11/2010

30/04/2011

   

WP 2.1

01/11/2010

26/02/2011

SC

COinS metadata on posts

WP 2.2

01/11/2010

29/01/2011

SC

Client-side, permanently linked versions

WP 2.3

03/01/2011

26/02/2011

DS

Bi-directional links to other datasets

WP 2.4

01/03/2011

30/04/2011

PL

Semantic linking with CITO

WP 3

01/11/2010

30/07/2011

   

WP 3.1

01/11/2010

30/07/2011

GM

Specialist environment – Healthcare

WP 3.2

01/11/2010

30/07/2011

DS

Specialist environment – Microarrays

WP 3.3

01/11/2010

30/07/2011

RS

Specialist environment – Workflows

WP 4

02/08/2010

30/06/2011

   

WP 4.1

02/08/2010

30/04/2011

GM,DS

Authoring documentation and tools

WP 4.2

02/05/2011

30/06/2011

GM,SC

Referencing documentation and tools

 

5. Project Management Arrangements

23The project will be managed from Newcastle University; the primary management will be from Dr Lord who will be responsible for:

  • Developing Project Management Plans;
  • Ensuring that the Project technical objectives are met;
  • Prioritising and reconciling conflicting opportunities;
  • Reporting and collaborating with JISC programme Manager;
  • Dissemination of the k-blog platform.

Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via skype or phone as appropriate. Although most project staff are co-located, primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. User consultation will be via public mailing list, as well as through a “dogfooding” k-blog. All project staff have been handpicked; they are highly experienced and self-directed, as outlined elsewhere. All are associated with several other projects and duties (research, research support, teaching and training), and are responsible for managing these independent workloads.

 

5.1 Risks

24Staff Risk – as with all projects, loss of staff could negatively impact on this project; however, all staff are on permanent contracts, have long histories in research, so this is less likely. Additionally, by dividing the work between five individuals, we limit the risk should a single person leave.

WoP3 and other dependencies – the project depends on other software, most notably WoP for which a new version (3.0) is now in beta; however the software is widely supported. Other software is replaceable.

Standards Shifting – the project depends on a number of standards and these may change. In this project, we will NOT support standards, but rather use those that support us. Where standard change rapidly, their implementation will be delayed (till they stabilize) or dropped. None of the standards described here is critical to the success of the project.

 

5.2 IPR Position

25All code will be developed under open source licences. WoP and RT are licensed under GPL, so code linking to these will be likewise licensed. Code that is separable will be released under LGPL. Code will remain copyright of respective institutions or authors. Any documentation produced by project staff relating to the project will be licensed under Creative Commons Attribution license. Licensing of individual k-blogs will be delegated, but permissive licenses will be encouraged.

 

5.3 Sustainability

26This project is largely based around innovative, novel and leading use of existing software. As such the sustainability of the majority of the technology base is not dependent on project members but large companies with established and proven business models. The k-blog process will be cleanly separated from its implementation, ensuring only weak dependencies to underlying software. Where, we produce software “glue”, public and widely supported APIs will be used where possible. This will ensure that components are replaceable. All code, including historical versions will be publicly available. Documents produced by project staff will be publically available and clearly licensed so will be archived through the internet “cloud” resources; we are also seeking explicit support for archiving from the British Library.

 

5.4 Staff Recruitment

27All staff are already in post.

 

5.5 Key Beneficiaries

28Our key beneficiaries are the public health, microarray and workflow communities; as the k-blog process is based around commodity software, these groups can use the basic environment from the first day of the project to generate and share content. As the project progresses, so will the process, the software to support it and the documentation to explain it; at all stages, the k-blog process fulfils a clear and immediate need. While we are specifically targeting these communities, the k-blog process and platform is sufficiently generic that it can support a wide range of research activities.

Although presented here as a single platform, the process and components are separable and can benefit communities independently. In particular, the tools and documentation from WP2 and WP4 will find use within the research blogging community, who find, in particular, the lack of tooling for referencing difficult. Finally, the statement of a peer-review process, and its implementation within RT will be applicable to any peer-review environment regardless of the form of publication. This includes publications published using wiki or other Content Management Systems.

 

5.6 Engagement with Community

29We consider the mechanism for engagement with four kinds of community: engagement with our core content generating community is an intrinsic part of this proposal, as described in WP3. Further interaction with more disparate groups will be maintained through personal contacts; each of the five individuals named in this proposal are experienced and embedded in different communities (health care, microarray, ontology, proteomics). Engagement with our core content consuming community is, again, an intrinsic part of the proposal; all project communications will be via open mailing list or k-blog. Project members are active users of Web 2.0 social technologies; our initial trials as part of Ontogenesis showing this approach to be highly effective form of dissemination, with minimal effort. Engagement with software users will be via website and direct interaction. All software will be released or advertised via normal channels (website, versioning, and mailing list), including a (debian) package repository for those wishing to set up their own server. Finally, developer communities will not be specifically targeted, but our open source, continually integrated development plan will be attractive, and we will accept suitably licensed contributions.

30All communities will benefit from the open and agile development methodology we will adopt; changes to the environment will be integrated and released rapidly, ensuring continual improvement and facilitating rapid feedback cycles.

 

6. Previous Experience and Project Team

 

31Dr. Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He was an active participant in the Ontogenesis network, and developed the initial idea for knowledge blogs as part of this. He is an active blogger and developer.

 

32Dr. Georgina Moulton is an Education and Development Fellow at The University of Manchester. Since 2005 her main roles have been to co-ordinate the development, and delivery of multi-disciplinary bio/health informatics education programmes; and to facilitate the engagement of biological and health communities in a variety of bio and health informatics research projects (e.g., ONDEX, Obesity e-Lab). For 3 years, Georgina was the EPSRC funded Ontogenesis Network Manager, in which she co-ordinated the activities of the network and expanded the network through the facilitation of the development of new activities and was involved in the trial k-blog process. More recently her work includes the development and delivery in conjunction with NHS partners of an education and development programme tailored to match the needs of North West public health analysts and the wider healthcare workforce.

 

33Dr. Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001.  Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities.  Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data.  His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio-Linux) as well as acting in a research capacity analysing high-throughput data.

 

34Dr. Simon Cockell has a PhD in Genetics from Leicester University, and refocussed into Bioinformatics with a Masters degree from Leeds in 2005. From there he moved to Newcastle, and the Bioinformatics Support Unit. Since coming to Newcastle, Simon has worked on a range of projects involving large scale analyses (AptaMEMS-ID), data integration (Ondex) and health informatics (MRC Mitochondrial Disease Cohort). 

 

35Dr Robert Stevens is a senior lecturer in Bioinformatics in the Bio and Health Informatics group at the University of Manchester. His main areas of research are in the development and use of semantics within the life sciences. This is blended with the use of e-Science platforms to gather and manage the data and knowledge of the life sciences. He was PI on the Ontogenesis network that ran the meetings for the first k-blog. He is or has been a co-investigator on the myGrid and myExperiment grants that will provide both content and technical input to this project. As well as the JISC funded myExperiment project, Stevens was an investigator on the JISC funded CO-ODE project that developed Protégé 4. On the back of this, Stevens has led the OWL training activities at Manchester that has directly fed in to the Ontogenesis k-blog. This range of experience makes Stevens an ideal partner to lead the development of content within this project.

 

This post carries the text of a paper accepted for PLoS One. I publish it here as a pre-print because of the recent discussion on OBO discuss about realism. I have converted this from the original latex, which isn’t perfect. Apologies for errors.

The [PDF] is available here.

Adding a little reality to building ontologies for biology
Phillip Lord and Robert Stevens
School of Computing Science
Claremont Road
Newcastle University
Newcastle-upon-Tyne, UK
phillip.lord@newcastle.ac.uk
School of Computer Science
The University of Manchester
Oxford Road
Manchester, UK
robert.stevens@manchester.ac.uk

Abstract

Background: Many areas of biology are open to mathematical and computational modelling. The application of discrete, logical formalisms defines the field of biomedical ontologies. Ontologies have been put to many uses in bioinformatics. The most widespread is for description of entities about which data have been collected, allowing integration and analysis across multiple resources. There are now over 60 ontologies in active use, increasingly developed as large, international collaborations.

There are, however, many opinions on how ontologies should be authored; that is, what is appropriate for representation. Recently, a common opinion has been the “realist” approach that places restrictions upon the style of modelling considered to be appropriate.

Methodology/Principle Findings: Here, we use a number of case studies for describing the results of biological experiments. We investigate the ways in which these could be represented using both realist and non-realist approaches; we consider the limitations and advantages of each of these models.

Conclusions/Significance: From our analysis, we conclude that while realist principles may enable straight-forward modelling for some topics, there are crucial aspects of science and the phenomena it studies that do not fit into this approach; realism appears to be over-simplistic which, perversely, results in overly complex ontological models. We suggest that it is impossible to avoid compromise in modelling ontology; a clearer understanding of these compromises will better enable appropriate modelling, fulfilling the many needs for discrete mathematical models within computational biology.

Introduction

Ontologies are now widely used for describing and enhancing biological resources and biological data, largely following on from the success of the Gene Ontology [1]. Ontologies have been used for many purposes, from schema integration to value reconcilliation to query interfaces [2]. Ontologies have also become a cornerstone of computational biology and bioinformatics. As computationally amenable artifacts they are, themselves, a direct part of computational biology; many computational biologists are involved in their production and maintenance. Many more use ontologies to summarise their data, often by looking for over-representation [3], as the basis for drawing computational inferences about data [4], or as the basis for determining semantic similarity [5]. Even those not making direct computational use of ontologies are likely to come into contact with them, for example, when preparing annotation as part of their data release [6].

It is, therefore, of vital interest to computational biologists that ontologies for use within biomedicine are fit for purpose. One effort that aims to increase the quality of the ontologies available within biomedicine is the “OBO Foundry” [7]. The main tool that it uses for this is “an evolving set of shared principles governing ontology development”. The initial eleven principles of the OBO Foundry [8] were largely concerned with what might be termed ‘good engineering practice’ (ontologies must, for example, be openly available, with a common syntax, well documented, and used). These principles have later been joined by a further eleven [9]; these include principles such as “textual definitions will use the genus-species form”, “Use of Basic Formal Ontology” and, the somewhat quixotic, “terms […] should correspond to instances in reality”. These stem not from engineering practice, but from a perspective called realism.

The many different uses for ontologies that we have described are reflected in different understandings and methodologies about how and what to represent in an ontology. Over the last few years, for many uses the paradigm has moved from “a conceptualization of the application domain” toward “a description of the key entities in reality”; it is this latter approach that defines realism [10]. This approach to ontology is typified by the Basic Formal Ontology (BFO); a small upper-ontology for use within science in general and biomedical ontology building in particular [11].

There has been significant discussion regarding the possibility of representing only “real entities” in computational ontologies [12]. Likewise, there has been significant discussion about the philosophy surrounding realism and the role of ontology in its representation [10]. While it is argued by some that it is possible to represent only reality when making a domain description, there has, however, been little discussion on whether it is necessarily desirable to do so.

In this paper, we consider the implications that realism has for the choices that are open to the ontologist while they are modelling their domain of interest. In particular, we consider the implications that this has for the computational capabilities of any resultant ontology, in terms of its ability to represent scientific knowledge in a computationally amenable form, as well as the ability to perform automated inference or statistics over this knowledge. We suggest that the application of realism results in ontologies that are over-complex, awkward or limited; as such, realism falls far short of its aim of increasing the fitness-for-purpose of ontologies. This approach, therefore, is unlikely to fulfil the needs of computational biologists whom form a substantial part of both the user and developer community for bio-ontologies.

Methods

In this paper, we take the approach of a number of worked exemplars; this is a complementary approach to an in-depth consideration of the modelling decisions for a particular area or particular ontology, which we have used previously [13], as it allows broader conclusions about the general principles of ontology development. For each section, as well as the main exemplars, a number of related examples are briefly discussed, to reinforce that the issues raised are, indeed, general.

The exemplars have been selected by several criteria. First, all the main exemplars are all taken from within biomedicine; this is also true for the majority of the related examples. Second, we have chosen exemplars that provide as wide a coverage of biology as possible. For practical reasons, third, we have chosen exemplars where the underlying science is relatively basic to much of biology and is likely to be immediately clear to the reader without significant explanation.

We have chosen exemplars requiring as little knowledge of specific ontologies as possible. We refer to only three. The first is BFO (see “sec:what-realism-2”) which is a canonical example of a realist ontology. BFO is described as a cross-domain, upper-ontology; as a result, most terms fail the criteria given above; they are of poor biomedical relevance, and are not basic science or immediately clear. We have, therefore, also used PATO (see http://obofoundry.org/wiki/index.php/PATO:Main_Page); this defines “qualities” that we might consider attributes of other entities; so, the authors of this paper have a height, weight and shape, all of which are considered to be qualities of the authors. Finally, we use the relationship ontology [14]; this describes the relations between entities. So, for example, the height of the author inheres_in the author.

As discussed in this and other works [15, 16], “realism” is itself poorly defined. Where this lack of definition makes the consequences of realism hard to determine, we have taken the practical course, of showing the consequences as they play out in practice; to an extent, therefore, these three ontologies are not only exemplars for realism, but define it, as it is currently practiced. In short, for this paper, when we say “realism”, we largely mean “realism as practiced by BFO”. We do not claim, in this paper, to address all the philosophical perspectives that through time carried the name “realism”.

Results

What is Realism?

Building ontologies based on reality is obviously appealing to most scientists; after all the study of reality to determine its behaviour and laws is the goal of scientists. A brief consideration, however, shows that this notion cannot define a methodology for the building of ontologies.

Within the context of science “reality” would normally be taken to mean our experimental or observational data; but the statement that science (ontologies) should be based on experimental or observational data is a truism and, as such, has no explanatory power. The “real” in realism refers, in fact, to the belief that the categories that we can use to divide entities are, themselves, real.

This distinction stems from an old argument from philosophy; realism against conceptualism. Again, both sides of the argument agree that the world we can percieve, and as scientists, experiment on, is mind-independent. The conceptualist, however, argues that the categories that they term concepts are a product of social agreement. Conversely, the realist argues that these categories that they term universals are themselves real, that is mind independent in their own right, like the entities they describe.

This distinction may seem fairly confusing; as Russell [15] says “if I have failed to make Aristotle’s theory of universals clear, that is (I maintain) because it is not clear”. In fact, there is a third possibility that is a more empirical view—that is, if categories (or other models) help in describing and predicting experimental data, then they are useful regardless of whether they are real or otherwise [17]. As an example, the Mendelian notion of segregating units of inheritance was defined and useful many years before a complete mechanistic description of their cause was available. In this context, we note that there is no commonly used term to express this form of category; most commonly, “concept” is used.

For a field with a core activity of providing definitions, there is surprisingly little agreement on the meaning of the word “ontology”; as there have been many papers on the topic, we consider just a few that reflect the distinction between these approaches. Probably the most commonly cited definition [18] describes an ontology as “a specification of a conceptualization”. This definition emphasises the formality (i.e. logical and, therefore, computationally amenable) aspect to ontology development.

This is countered with a realist definition; while the requirements from Gruber’s definition—a formal specification—are necessary, realist ontologies add the requirement that “the nodes and edges correspond not to concepts but, rather, to entities in reality” [19].

What does“reality” in this context actually mean? Definitions such as “that which exists” are strangely circular leaving the question of what “exists” means. Smith [12] adds the priviso that reality is “captured in scientific laws”. Being a scientific law is not strictly enough, as some are later shown to be wrong, but a scientific law is the current best attempt at reality; this possibility does not make an ontology non-realist. For a realist ontology, the nodes are “universals”—entities in reality—rather than concepts; at least one particular must exist for every universal.

This still leaves the difficulty of applying the realist definition in practice. So most scientists will happily accept, for example, that a cell is real as it is an entity that can be observed, interacted with and manipulated. However, concepts such as “function” [13] have raised more discussion [20]; is this “real” or just a word biologists use as a point of reference? While the definition involving “entities in reality” maybe of philosophical interest, they are hard to turn into a specific assay; how to test whether a particular concept is, also, a universal. Instead of a clear assay for existence, realism offers direction about what concepts are NOT reality, rather than those that are reality. For example, and perhaps ironically given the negative practical definition of reality, a statement such as:

  Dog is_a not Cat

is not held to be a statement about reality as it is a logically constructed example of subsumption (an is_a relationship); there is no real universal containing particular not Cats in existence. Likewise,

  Dog is_a (Dog or Cat)

as the existence of particular Dogs and Cats does not mean that there are any particular Dog or Cats (examples modified from [12]).

This is not meant to provide a complete introduction to “realism”, but to provide a grounding for the discussion that follows; we will consider the issues raised by realism, throughout the paper. A more philosophical treatment of realism is given by Merrill [16]. It is useful to note that Gruber’s [18] statement that “And it [a computational ontology] is certainly a different sense of the word than its use in philosophy.”. In this paper, we are concerned with the ontologies as computational artefacts.

To summarise, a realist approach to ontology says that the categories or universals in to which objects or particulars fall have an existence in their own right. It is these universals and only these universals that a realist approach says should be the nodes within an ontology. In this paper we examine whether this approach is an adequate means to provide an account for the data produced by biomedicine.

Models that represent reality

In this section, we suggest that many universals have a range of representations. In some cases, the choice of representation may be obvious, such as length which has a natural scientific representation in SI units. In many cases, however, there is no clear set of criteria for choosing between representations. We consider the way that one quality, colour, could be represented ontologically.

Colour is a complex phenomenon. The colour of an object or other phenomena arises, in part, from that object and, in part, from the eye that perceives it.

A representation of the physical reality would be an account of the reflection, transmission and perception of light by an organism. Such an account of the reality of light and its perception might cover the following facts: Chlorophyll is green in reflection and red in transmission; a flower petal appears white to a human, but has UV stripes to a bee; the plant leaf and the algae appear green to humans, but have different reflection spectra because their chlorophyll co-ordinate to their Mg2+ ion in different ways.

There have been a number of different attempts to represent the complexities of colour numerically, for a number of different purposes. These are models that allow us to describe colour, without having to deal with the underlying physics or reality of colour. Probably the best known of these are RGB (Red, Green, Blue) or HSV (Hue, Saturation, Value), both of which are additive colour models appropriate for describing colour on a display screen. CYMK (Cyan, Yellow, Magenta and Black) is a subtractive colour model and commonly used for printing.

Collectively these representation schemes are known as colour models. That none of these schemes has become predominant reflects both their different uses and the preferences of different user groups.

For the ontology builder, this leaves us with a difficult choice:

  1. We bless one of the colour models, substituting the model for the underlying physics and do not describe the others.

  2. We describe all of the colour models, but do not describe that they are part of a colour model.

  3. We explicitly describe the reality of the physics, biology and the relationship to the different colour models, reflecting the practise of describing colour in much of science.

Currently, considering the PATO ontology, which is documented as being built according to realist principles, the first approach has been taken, using the HSV scheme. So, PATO has a term Color Hue (PATO:15) that is defined as :

“A chromatic scalar-circular quality inhering in an object that manifests in an observer by virtue of the dominant wavelength of the visible light; may be subject to fiat divisions, typically into 7 or 8 spectra.”

Using this model, PATO describes red (PATO:322) as :

“A color hue with high wavelength of the long-wave end of the visible spectrum, evoked in the human observer by radiant energy with wavelengths of approximately 630 to 750 nanometers.”

This modelling approach has a number of limitations.

  • The decision to choose one colour model or the other is arbitrary. While there are reasonable justifications for the use of HSV as opposed to, for example, RGB, there is no a priori justification for use of an additive colour model as opposed to a subtractive model. Both are valid, for different usage; in general, reflective colour is more common in biology (e.g. pigmentation) than emitted colour (e.g. fluorescence) which would suggest that subtractive models are more generally applicable, but a full treatment requires both.

  • There are no terms which can be used to express data described according to other colour models, necessitating a transformation between the different models into the officially “blessed” version during application of the ontology. These transformations may be lossy and not fully reversible.

The second approach is also possible. This would allow expression of data in multiple colour models, however:

  • The ontology would tend to get rather confusing as more colour models are added; colour would have children “Hue”, “Red” and “Cyan” and seven other sibling terms.

  • It is not clear which terms comprise a colour model: do values for “Hue”, “Green” and “Magenta” specify a colour?

  • It is not clear whether terms that occur in the other contexts are equivalent. Is “Red as in RGB” the same or different as Red (PATO:322)? Is “Hue as in HSV” the same or different from “Hue as in HSL” (HSL is another additive colour model).

The third approach does not suffer from the limitations described. We suggest from this analysis that it is necessary, if unfortunate, for some qualities to be explicitly described with multiple representations. To avoid confusion, the universal quality, colour, would need to be explicitly described as having multiple valid models. Yet, realism argues that we should not do this, as colour is real and not a model; more over, the focus on realism means that the documentation does not describe the choices that have been made, nor refer to the relationship between Color Hue (PATO:15) and “Hue as in HSV”. In short, realism has limited our ability to represent colour.

Related Examples

There are many different examples of this issue; having two or more models to describe the same part of reality is common. The distance between two markers on a chromosome can be measured using (one of a number of) genetic techniques. Some qualities have a bewildering array of different measurements associated with them; Wikipedia, for example, lists 13 different measurements of concentration such as molarity or \(gm^{-3}\).

This issue has been previously recognised. In computing science, explicitly modelling one model in another is a form of metamodelling. Other, non-realist, upper-ontologies such as DOLCE use the concept of Quale to describe a cognitive abstraction (such as Colour), including those over a physical quality (such as the spectral properties of reflected light) [21].

Sequences and the Central Dogma

The central dogma of molecular biology suggests that all genetic information is encoded in the DNA of a cell, as the ordered nucleotides that comprise the DNA. RNA is transcribed from this DNA. The RNA molecule also has a defined order of nucleotides related to the DNA. Finally the RNA is translated into protein.

Consider an ontology describing these entities. First, the DNA molecule has a number of properties; as well as physical dimensions (discussed further in “sec:limits-consistency”), including a length expressed in metres, it consists of a number of monomeric units. So, for example, we might say a DNA molecule with a series of nucleotide residues represented as ‘GATC’ has­Monomeric­Part 4.

This causes a slight worry from a realist perspective; the number 4 may not be a realist universal. There are no instances of 4. In this case, the number 4 is being used to describe a part of reality, so this is allowable in a realist ontology. Alternatively, we could describe the same reality using units (traditionally base-pairs or bp). Therefore, the DNA molecule has­Polymer­Length 4bp.

Accepting the use of natural numbers in this way, also means that we accept the use of sets and sequences to describe reality. One definition of 4 is a sequence. Stating that the DNA molecule represented with the sequence ‘GATC’ has­Polymer­Length 4bp is equivalent, therefore, to stating that it hasSequence ‘NNNN’ where ‘N’ is any nucleotide residue.

It should be noted, however, that the usefulness of these statements stems from our implicit knowledge. The number 4 is a natural number, so has­Monomeric­Part 4.2 is not possible. If a new monomer is attached to our DNA molecule, it will now has­Monomeric­Part 5, because the natural numbers are additive. We understand the operation of natural numbers as part of our shared, background knowledge, and we can apply this knowledge here.

Having described that the DNA molecule represented as ‘GATC’ has­Polymer­Length 4 (or hasSequence ‘NNNN’) we might wish to be more specific about the order of nucleotide residues and state hasSequence ‘GATC’. The implicit background knowledge we used previously about the natural numbers still applies here.

Next consider the process of transcription. The previous discussion about DNA likewise applies to RNA. The RNA molecule will, however, hasSequence ‘GAUL’, as RNA uses a different set of bases to DNA. Mathematically, one sequence can be determined from the other by applying a mapping; though the mapping is a human activity, not a representation of biochemical reality. To describe this, we have two options:

  • Taking the realist approach, we can continue to rely on the implicit knowledge of the biologist, as we have previously relied on an implicit understanding of the natural numbers.

  • We can be explicit about the properties of these sequences (additional to those properties shared with the naturals). We can talk about non-real world concepts such as alphabets, transformations and how these map to the real entities involved.

It should be noted that the former severely limits the ability to describe the central dogma. The transformation of DNA to RNA sequence is simple, but the transformation of RNA to protein is more complex. Again, the choice is between representing reality or representing how we practise science.

Related examples

The issues relating to sequences are fairly general. In computer science terms, these are abstract data types. The DNA sequence is a kind of sequence with special properties (a limited alphabet). Many of the physical quantities in science have special properties in this way. Consider:

Temperature:

While these look like positive real numbers, temperatures are only meaningfully subtracting from each other, which gives information about heat-flow between two bodies. Other operations (addition, multiplication) which are useful for real numbers have little meaning for temperature.

Recombination Distance:

These look like probabilities but are not, requiring a transformation to add.

There is a limitation on the ability to use abstract data types within a given ontology language; in most cases, the expressivity of the language will not allow arbitrary mathematical relations. Some languages, such as OWL, for example, provide “concrete domains”; these provide extension points within the ontology language where, for example, the special properties of temperature could be represented; other languages do not. In either case, there are limitations to these capabilities; for example, the constraint and behaviour of a concrete domain needs to be interpreted with its own semantics within a reasoner, rather than expressed explicitly within the ontology. It may make more sense in many circumstances to describe the existence of a mathematical model as discussed in “sec:go-where-science”.

The limitations of computers

Modelling continuous properties is a common problem in ontological engineering. For example, according to statistics the western world is now facing an obesity epidemic; in short many or most of us weigh too much. Understanding, however, exactly what “too much” means is not necessarily simple; a common technique to use is body mass index (BMI)—body weight divided by square of the height, which is a continuous value. The BMI range is split into 4 categories: Obese (>30), Overweight (>25), Normal (>18.5) and Underweight (<18.5). These categories represent ranges of the value of BMI.

This data simplification has many justifications. On an individual basis, the BMI is not a particularly accurate measure, so the simplification does not lose much accuracy. It is also easier to describe to patients, for whom a “BMI of 25” will be less comprehensible than being “overweight”.

Modelling some of this is straight-forward. Height and weight are modelled as properties of the individual. The BMI would therefore appear to be a property of the individual as it is a restatement of two existing properties. It would appear, therefore, that the category into which an individual falls should also be a property of the individual.

Consider the values of the property next. These categories are an abstraction over the real-world properties. Although, height as an integer value is expressed using a non-real-world entity, it is a description of a part of reality. A range, however, in the BMI does not describe part of reality in the same sense. There are no instances of BMI “Obese”. In a realist ontology, therefore, it is unclear what the relationship is between BMI Obese and the individual person.

For the statistician or computer scientist, there is an additional advantage to the simplification; four discrete groups have better computational properties than a continuous measure. Database queries become easier to write, and quicker to run. This is also true for the ontology builder; simplifying the real-world may fulfil the needs of an application for which the ontology is built, while avoiding unnecessary complexity. This is a widely used method for representing partitions of continuous values, the appropriately named value partition [22].

In the case of BMI there is a pre-existing social agreement toward a set of categories; however, even in the absence of such an agreement, the ontology builder might wish to represent a continuous range as a value partition to decrease the complexity of their ontology. The value partition is useful, but many of the concepts involved are not realist universals. The choice, then, is modelling “reality” and modelling a simplification that is easier to use and has better computational properties.

Related Examples

Splitting the two cases, there are many examples of pre-existing simplifications. From medicine, there are so many that it seems to be the norm rather than the exception: hypo- vs hyperthermic; hypo vs hypertensive; hypo- vs hyperglycemic. In many cases, these ranges have standard interpretations akin to the BMI.

There are likewise a number of constructions or design patterns that reduce complexity, extend the effective capabilities of the language or simply provide standard solutions to common problems [23].

To go where science has gone before

Many experiments in biomedicine require the measurement of some physical property of a biological system. Take, for example, the measurement of heart rate; in standard practice, this is measured in beats per minute, and is calculated simply by counting beats (\(b\)) over a time period (\(t\)) and dividing one by the other (\(b/t\)). However, what time period is appropriate? We might choose 60s, but this raises the question, what is the meaning of heart rate over shorter periods?

Fortunately, there is a standard solution to this problem, which is to define heart rate using differential calculus; so heart rate becomes \(db/dt\).

The derivative, \(db/dt\), presents some problems from a realist perspective. As noted previously (see “sec:sequ-centr-dogma”), it is possible to associate real numbers with entities; however, \(db/dt\) is \(0/0\). It is not clear whether this quantity is a universal; it is certainly the case that the expression \(db/dt\) is not a universal, yet such values and calculus itself is apowerful tool within science and not using it within ontological models is a severe restriction.

We can describe this ontologically in three ways:

  • We can model the real world entities involved – beats, time and describe nothing else.

  • We can describe rate in mathematical terms. In this case, we are defining the heart rate as a mathematical abstraction.

  • We can model the heart rate as a real world entity, \(db/dt\) as a mathematical entity and explicitly state that $latex db/dt is a model of heart rate.

These different solutions present different advantages. The first is consistent with realism. The second is consistent with the most common definition used within science. The third is consistent with both but it is unclear when to use which term (for example, is \(\Delta {}b/\Delta{} t\) an approximation of \(db/dt\), a quantification of the real world quality or both)?

In most cases for the description of science, the second option makes most sense; conflating the mathematical model with the real entity enables us to use the advantages of two different modelling techniques without introducing the confusion of the third option.

Related Examples

There are many related examples from mechanics, electromagnetics or chemistry; as with value partitions in medicine, so many that they appear to be the norm. All of these subject areas have direct relevance to biology and, perhaps even more so, to the equipment used in the practice of biology.

Mechanical examples would include velocity (\(dr/dt\)) and acceleration (\(d^2r/dt^2\)). Electromagnetics would include current (\(dC/dt\)) and capacitance (\(dV/dt\)). Chemistry examples would include rate constants and pH. In biology, population biology, systems biology and neurosciences make wide use of mathematical models. The lack of a link in realist ontologies to these mathematical models is not free from consequences (described further in “sec:discussion”).

The more general issue comes not from relating to differential calculus, but relating to pre-existing non-ontological techniques. For example, taxonomy in the linnean sense. There have been many discussions about whether species and high taxons are reflective of reality; it is certainly the case that a number of higher taxons do not reflect phylogeny [24]. Given that it is of uncertain status, should we represent taxonomy as a quality of an organism, an independent conceptualisation of the biologists or both?

The limits of consistency

Physical biological entities such as cells and organisms have an extent in the real world. This paper’s first author, for example, has a height of around 1.8m; a similar value cannot be applied meaningfully to the electronic version of this document, although it may apply to the paper that it may be printed on.

There are a number of different, well-understood mechanisms for representing physical space. We can use a dimensional or cartesian model, with three perpendicular lines with a linear scale. We can use a polar model, expressing extent using angles and a single distance. Modern physics has told us, however, that all of these are limited models of reality; physics generally uses a four dimensional Minkowskian spacetime model; here the axes are not linear; motion of the observer down one will change values down the others. Alternatively, at a quantum level, length is a probability distribution.

For the ontology builder, this leaves a difficult choice and the same choice discussed previously in “sec:colo-colo-models”: Represent the reality physicists relate; bless one, ignore the rest; describe their components but not their models; explicitly describe them.

If the ontology builder is to be consistent, then, they should make the same choice in both cases; if we describe colour models, we should explicitly describe Minkowskian spacetime, quantuum probability distributions, cartesian and polar systems.

There are, however, two important differences to colour models. First, there is a strong social bias toward cartesian systems. Secondly, within the scope of biology and the life sciences, four dimensional spacetime or quantuum models confuse rather than simplify; the relativistic corrections produce such small differences that they are statistically meaningless; similarly, describing a leg as a probability distribution adds little other than complexity.

This leaves the ontology builder with two options:

  1. We can build an ontology with a consistent relationship to reality. So, having decided to explicitly represent colour models, this suggests that we should also explicitly model 3D space, 4D spacetime and the various co-ordinate systems that are used to describe these.

  2. We build an ontology with an inconsistent relationship to reality. So, we might be explicit about colour models, but arbitrarily bless 3 dimensional space, using cartesian co-ordinates.

The compromise here is very straight-forward. The first solution retains its consistency to reality, the second is consistent with usability and usage; for biomedicine, a 3D cartesian co-ordinate system plus time is likely to be enough for the foreseeable future and makes life easier in the meantime.

The Newtonian view of the world is the best model in this case: it is good enough. When building an ontology for biomedicine, it makes most sense to use this view as it will produce the results required. If, in the future, biomedicine advances so that relativistic or quantuum representations are necessary, then current ontologies will need refactoring; even then, this future cost is likely to be offset by gains in the present.

Related examples

In the choice of units for measurement for scientific purposes, SI units are to be preferred. It should be noted, here, that there is a domain dependency; for an engineering ontology, the use of American imperial units would be inevitable.

For most of biology it is unnecessary to distinguish between the length of the calendar year and the astronomical year—the latter changing with respect to variability in the motion of the earth. There are occasions when this distinction may be important for data integration in bioinformatics as leap years and leap seconds show.

For an ecologist counting the number of trees in a sampling square 100m by 100m, they will take the area as 10,000m2; The surface is, however, neither smooth nor a Euclidean plane, so this area is wrong in reality. For much of ecology, this distinction will not matter. Again, there is a domain dependency here; whale or bird biologists interested in migration patterns may well care about the curvature of the earth.

Discussion

Realism has been held up as a methodology for “good” ontological modelling, and the production of more tightly defined and consistent ontologies. In this paper, we have discussed five different cases, with biological examples, that we might wish to model ontologically; for each, we have presented different models, describing the same underlying science. In each case, a realist solution is possible, but places either limitations or awkwardness on the models produced.

Building an ontology with a consistent relationship to reality may help to enable interoperability [7] under some circumstances. If, however, it disallows modifications for computability (see “sec:work-around-comp”), or requires arbitrary blessing for one form of specification over another (see “sec:colo-colo-models”) it may have the opposite effect.

Nor are the issues discussed in this paper free from consequences. In “sec:go-where-science”, we discussed interoperability with existing scientific models. Mathematics and physics have produced complex, refined and expressive notation systems, representing a deep understanding of how numbers and the physical world work. These are, however, not being used in current ontologies and this results in a lack of precision, errors and omissions:

Lack of Precision:

The PATO term speed (PATO:8) which is defined as:

“A physical quality inhering in a bearer by virtue of the bearer’s rate of change of position”

with a synonym of velocity; from this definition, we cannot distinguish the vector and scalar quantities of velocity and speed; indeed, it is not clear which of these two speed (PATO:8) is. Meanwhile acceleration (PATO:1028) is defined as:

“… the rate of change of the bearer’s velocity in either speed or direction”

which is implicitly a vector quantity, and contradicts the statement that speed and velocity are synonyms. The mathematical definitions (velocity as \(dr/dt\), speed \(\left|{dr/dt}\right|\), acceleration \(d^2r/dt^2\)) are precise, concise and accurate.

Errors:

Similarly, length (PATO:122) is defined as a quality; qualities have to inhere in Independent Continuants; as a Spatial Region is a child of Continuant this means that Spatial Regions cannot bear lengths. In short, in current versions of BFO, there is no intuitive way of modelling the length of a region in space.

Omissions:

BFO is mass-centric; it is currently unclear where many physical entities exist, examples including energy, waves (through a medium) or EM radiation. Likewise, it lacks a natural position for numbers (that have no particulars), patterns and distributions. Yet, these entities are key to a physical description of the world.

To our mind, these are indicative of some of the most serious flaws of realism-based ontology building. It makes little sense to replicate the models of physics using English instead of a more precise mathematical notation. If BFO had been built using direct links to a grounded physical model of the world, it seems likely that these problems would not have arisen.

We have discussed a number of concrete examples where building an ontology by considering realist concerns has detrimental consequences for the model. We believe that the real world entities and the relationships between them is only one consideration among many: simplicity, usability, fitness for purpose are equally important.

Taken to its most extreme form realism, it seems to these authors, would produce models unsuitable for use within science. There is a choice between a correct account of reality that does not allow the data of science to be adequately described and a description of reality that takes in to account how science is performed. Fortunately, most “realist” ontologies are not really so: PATOs representation of HSV for modelling colour is not a bad decision; it represents a straight-forward, pragmatic approach to ontology building, where the representation has been chosen on the basis of a use case, not the entities as they exist in reality. Similarly BFO uses a 3D plus time model of reality; it suggests that length are properties of the entity alone, without reference to the observer. This is not a true reflection of reality, but one which is a good enough approximation for use within the biomedical sciences; in short, usability and simplicity have been considered to be more important in the modelling process than the relationship of the model to reality. In accepting these compromises, BFO has placed itself squarely as a computational rather than philosophical ontology.

Despite these concerns, realism has made a contribution to the field of biomedical ontology engineering. By emphasising the importance of real-world entities and by encouraging a more specific interpretation than the generalisation of a “conceptualisation”, realism helps to avoid the introduction of unnecessary layers of abstraction. A consideration of the entities in reality may be a part of an ontology engineering process; ontology builders should have careful and considered reasons for diverting from modelling in this way and that ontologies should explicitly describe through annotations the terms that do or may divert from this view. Ontology builders should, however, be free to make this decision; the acceptance of compromise with respect to reality will result in simpler and more effective knowledge artefacts.

Johansson [10] when discussing realism asks the rhetorical question: “would you like to be treated for a physiological illness by a (non-realist) physician who is not sure that there are human bodies?” – (our emphasis). As scientists, our reply would be if their survival and success statistics were the best, we would not care whether they were a realist, a non-realist or a robot which admitted of no philosophical position at all; also, using a doctor who was strictly realist and thus cut off from much of the practise of science (such as determining heart rate) would disturb many patients. As bioinformaticians, we build ontologies to provide a descriptive and predictive model of the wealth of experimental data that is now available. In biology, the job of an ontologist is to describe data such that it can be analysed. Naturally this entails a description of entities in reality; it also, however, entails a description of science, and it entails compromise; we overlook this to our peril. The last 200 years of science shows the success and strength of this position; it is on this groundwork that we should build for the future.

Bibliography

[1]

Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. (2000) Gene Ontology: a tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–9.

[2]

Stevens R, Lord P (2008) Application of ontologies in bioinformatics. In: Staab S, Studer R, editors, Handbook on Ontologies in Information Systems, Springer. Second edition. URL http://www.cs.man.ac.uk/~stevensr/papers/handbook2.pdf.

[3]

Zeeberg B, Feng W, Wang G, Wang M, Fojo A, et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 4: R28.

[4]

Wolstencroft K, Lord P, Tabernero L, Brass A, Stevens R (2006) Protein classification using ontology classification. Bioinformatics 22: e530-538.

[5]

Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19: 1275–1283.

[6]

Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, et al. (2006) The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22: 866–873.

[7]

Smith B, Ashburner M, Rosse C, Bard J, Bug W, et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25: 1251–1255.

[8]

OBO Foundry Consortium (2006). OBO Foundry Principles. http://obofoundry.org/wiki/index.php/OBO_Foundry_Principles.

[9]

OBO Foundry Consortium (2008). OBO Foundry Principles. http://obofoundry.org/wiki/index.php/OBO_Foundry_Principles.

[10]

Johansson I (2006) Bioinformatics and biological reality. J Biomed Inform 39: 274–287.

[11]

Grenon P, Smith B, Goldberg L (2004) Biodynamic ontology: applying BFO in the biomedical domain. Stud Health Technol Inform 102: 20–38.

[12]

Smith B (2004) Beyond concepts: ontology as reality representation. In: Formal ontology in information systems: proceedings of the third conference (FOIS-2004). Ios Pr Inc, p. 73.

[13]

Lord P (2009) An Evolutionary Approach to Function. In: Bio-Ontologies 2009: Knowledge in Biology. URL http://hdl.handle.net/10101/npre.2009.3228.1.

[14]

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, et al. (2005) Relations in biomedical ontologies. Genome Biol 6: R46.

[15]

Russell B (1946) A History of Western Philosophy. Routledge.

[16]

Merrill G (2010) Ontological realism: methodology or misdirection. Applied Ontology 5: 79-108.

[17]

Dumontier M, Hoehndorf R (2010) Realism for scientific ontologies. In: 6th International Conference on Formal Ontology in Information Systems.

[18]

Gruber T (1992). What is an ontology? URL http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

[19]

Ceusters W, Smith B (2006) A realism-based approach to the evolution of biomedical ontologies. AMIA Annu Symp Proc : 121–125.

[20]

Shrager J (2003) The fiction of function. Bioinformatics 19: 1934-1936.

[21]

Seyed AP (2009) BFO/DOLCE Primitive Relation Comparison. In: BioOntologies 2009: Knowledge in Biology.

[22]

Rector A (2005). Representing specified values in owl: “value partitions” and “value sets”. W3C Working Group Note. URL http://www.w3.org/TR/swbp-specified-values/.

[23]

Egana M, Rector A, Stevens R, Antezana E (2008) Applying Ontology Design Patterns in Bio-ontologies, Springer Berlin/Heidelberg. pp. 7-16.

[24]

Schulz S, Stenzhorn H, Boeker M (2008) The ontology of biological taxa. Bioinformatics 24: i313–i321.

Elba was a lot of fun; it’s very biased toward beaches, but there are plenty of these, they are easy to get to and, generally, free. For my money, the best of these ones that we went to were Aquavivata (or something like that) and Sansone (next to each other — I swam to the latter) and Capo Bianco. Both of these are withing spitting distance of Portoferraio. which is the biggest town. It turns out that Capo Bianco is part of a marine reserve, which explains, with no fishing; this probably explains why the place was so rich with life that otherwise would have ended up on pasta. But, with a pebble beach, a slow sloping seabed still only 1 or 2m in depth some 50m from shore and with many rocks, and a headland it’s ideal for swimming and snorkelling.

As well as Elba, I got to Pianosa. This is an ex-penal colony, with no permanent residents. It’s a strange place, full of mystery and excellent snorkelling. It’s also full of history, occuring in two of my favorite books; first, Posthumous Agrippa was exiled and later killed here, as is told in I, Claudius. The exact site isn’t known, but the seem to have found his swimming pool. And, secondly, Pianosa is the setting for Catch-22, although it in reality, it’s too small to have contained the events; I didn’t manage to find out whether it was occupied during the war, but it didn’t have a airbase. The whole place is a marine reserve, and the snorkelling was the only place which beat Capo Bianco. Beautiful though Pianosa is, there is a fly in the ointment, which is the Zecce on the island; the place is infested with ticks, which means that you have a reasonable chance of coming home with a blood-sucking monstrosity attacked to any accessible capillary.

After Elba, I’ve come to Lake Garda. All the Italians are complaining that it’s caldissimo; of course, back in Newcastle, they all complain it’s not hot enough. Never satisfied with the weather; just like the British.

While travelling on Elba, I suffered the misfortune of a virus attack; I don’t use AV software these days, since it tends to break other things which take a long time to fix, and it’s been many years since I’ve lost a machine to malicious software.

The process, though, was quite entertaining. First, I started getting an error stating that system.exe needed .net to run properly. After a while, a Windows update happened, along with the normal malicious software removal update. This found the virus, probably killed it, then stuck up a dialog saying “Some of your files were nasty, so they need to be restored, please insert your Windows SP3 disk”. Clicking “ok” said “I can’t find the disk, perhaps a) you put the wrong disk in or b) your drive isn’t working”. Or c) you are on holiday, and your disk is 1000 miles away, and anyway, the machine is old enough to have come with SP2. All sort of raising the question why the software that I’d just downloaded from Microsoft, can’t download the system components to replace the ones that it’s deleted from Microsoft also.

After the reboot, all trace of networking software had been blitzed from the machine; I couldn’t even use loopback addresses. In the end, I’ve done a complete factory reset from the recovery partition which I thought I had deleted years ago. The process took about 15 minutes to recover windows, 1 hour to recover the sony application layer, 2 hours to remove all the sony application layer (one application at a time, including the 10 different wallpaper packages, because add/remove programs doesn’t allow multiple select), except for the power management tweaks and drivers, then another hour trying to figure out NTFS file permissions so that I could read my files.

Actually, the process hasn’t been a complete loss; I was thinking of re-installing the OS anyway. The boot had got to around 3 to 4 minutes which was getting daft. Now, with a clean OS, a complete reboot takes under a minute. It’s also been a bit of a walk down memory lane; currently, I have no internet, so my computer is in 2005 state; there is Office 2003 trial, a rubbishy media centre thing Sony probably wrote as an answer to iTunes, and macromedia flash. I had an emacs install exe in my recycle which I managed to recover before the reset so, ironically, Emacs is the newest piece of software I have on here.

Still, this administrative nightmare makes me wonder what to do next; XP is not too long for this world, vista is a poll tax on wheels, and I am just not sure I can be bothered with learning 7. I’ve used windows on the road for a long time, but I think I may go small, light netbook running linux. There have been a couple of times when I have needed MS Office, but it’s not that common, and there is always a work-around.

The last holiday that I went on produced a long stream of blog posts; this one, I suspect will result in only one or two, which reflects the different character of places. India is a place of conflicts, confusion and excitement; Elba, on the other hand, is a holiday resort, universal beautiful, relaxed; in short, wonderful for swimming, sitting on the beach and general relaxation, but not so wonderful for writing about on a blog.

I took the train from Rome to Piombino Maritima; as with other times, the Italian trains beat the British equivalent easily. While, in some ways, they are not quite as nice inside, they are plentiful, ontime and cheap; the 15 Euro I paid for a three hour journey would hardly get me past the platform in Britain. Piombino itself, appears to be a scenic chemical factory, while Piombino Martima is a working ferry terminus, which says it all.

Elba itself is much, much prettier; a small island, with a large mountain range in the middle. A lesser nation would have built towns around the edge, but, as this is Italy, there are also improbable towns cemented onto impossible slopes, with hair-pin roads snaking inbetween. At this time of the year, though, the focus is on the beaches; I’d love to attempt the 1000m walk to the highest peak, but in this climate, the water, sun-tan cream, and sun umbrella would just weigh me down too much. I think coming back in April for hills, plants and geology would be excellent, though.

Speaking of the beaches, well, there are many. Many of these are hopelessly over-crowded, but some are a little quieter, without motor boats. The swimming is, on the whole, excellent; I bought some flippers which I’m having great fun with; I can dive deeper and stay down far longer, whizzing along through the shoals of fish.

Marciana Marina, where we are staying, is lovely, with a long promenade, several sheltered harbour beachs, and a pebble beach at the end, open to the sea. There is a jazz festival on in the main square; I get the impression they have pretty regular events there, but we’ve lucked out here. The standard has been very high, covering big band, modern trios and a jazz harpist. I’ve enjoyed it all; the crooner with the big band sang standards with a Italian accent, which was strange, but good.

In the relaxation of a beach holiday, I’ve been thinking daft ideas, which I may write about later. One was language teaching related — it’s got a crazy acronym which is Progressive Inculcation of Language by Listening to Stories (PILLS). The second was a design for inflatible flippers, which would work in the water but would also be good for walking outside. And, finally, an idea for domesticated bats as a method for insect control.

Maybe, I’ll write about them. Or, maybe not.

Some advance on the knowledge blog front this week. Firstly, myself and Simon Cockell spent a short while setting up a development and testing environment and wrote our first wordpress plugin — ”Peaches” based around the Hello Dolly plugin, but with the lyrics from the Stranglers song instead. We finished this yesterday just before automattic released WordPress 3.0. Hopefully, it will be easy to upgrade. Rather more usefully, I got the very first version of a reference list plugin working. At the moment, it just transforms DOIs into hyperlinks.

And, secondly, I got notification from the British Library that they will be archiving the website. Good news, although there are not archives available yet.

We move forward!

For the third year in a row, I managed to the Northern Rock cyclone this weekend. It was a lovely occasion as before; the weather was cool in the morning with a brisk wind, but it warmed up a little and the wind dropped by the end of the day. The numbers have gone up slightly and it was good to see so many cyclists around.

Compared to last years’ ride I was way down. I just cleared 5:30 in the saddle, or 6 hours elapsed, which is about 1 hour slower (although the route is, apparently, 2 miles longer than last year). Not unexpected, given the absence of training; this is my longest ride of the year, 40 miles being the longest otherwise. Having moved house early in the year, and with an F1 in the works, I just haven’t found the time.

This really is a great event; at the moment, it’s big enough to be an event, but small enough to still feel personal. I hope that it will get bigger, although the roads mean that it could never rival the GNR, because this sort of mass participation event will help to bring cycling up the agenda. However, having done the GNR, I know that this would almost certainly lessen the pleasure of it.

It’s been relatively quiet from me for the last few weeks. One of the reasons for this is that I have been submitting a JISC bid. I’ve not submitted a JISC bid before, so it was quite a lot of work; it’s exactly the same as a research council proposal, except for all the bits that differ.

The bid, in this case, was for extensions to the Knowledgeblog environment; we want to make sure that it supports research better than at the current time. Our initial experiences were generally good, with a few naysayers. Additionally, we wanted much better linking to external forms of data; array express, Swissprot and the like. And, finally, we wanted to trial this out against a set of specific use cases. Critically, I also got tired of writing “knowledgeblog” the entire time, so they will now be “k-blogs”.

If it gets accepted, we proposing to develop some additional functionality, often reusing existing software. We really are trying to avoid developing any software that we don’t have to. The plans include:

  1. A documented k-blog process, including information on who does want, and how to use various existing tools (word and latex in particular).
  2. Proper support for referencing — authors should be able to drop in a PMID, or DOI and get a reference list and in-text citation automatically.
  3. Various metadata support, so that the in-text citations have semantics from the readers side.
  4. Trackback proxying for those resources which don’t support trackbacks.
  5. Integration and additional tooling for adding references and cross-links.

I’m hoping that we get the money; if we do, the work will give us a platform on which to build a publishing environment, a place for an educational resource, and finally, and excellent extension point for playing with semantic forms of publishing. I am not sure what the odds are; I know quite a few other proposals are going in, and there’s a reasonable chance that George Osbourne will cut the money back before its awarded. All I can do now is wait.

I’ll probably blog the whole proposal in a few days; this gives me a chance to try out the “blogging from Word” experience. How exciting.