Archive for the ‘Tech’ Category

This article was jointly author by Phillip Lord and Simon Cockell.

Rhodopsin is a protein found in the eye, which mediates low-light-level vision. It is one of the 7-transmembrane domain proteins and is found in many organisms including human.

Rhodopsin has an number of identifiers attached to it, which allow you to get additional data about the protein. For instance, the human version is identified by the string “OPSD_HUMAN” in uniprot. If you wish, you can go to http://www.uniprot.org/OPSD_HUMAN and find additional information. Actually, this URI redirects to http://www.uniprot.org/P08100.html. P08100 is an alternative (semantic-free) identifier for the same protein; P08100 is called the accession number and it is stable, as you can read in the user manual. If you don’t like the HTML presentation, you can always get the traditional structured text so beloved of bioinformatics; this is at http://www.uniprot.org/P08100.txt. Or the Uniprot XML (that is at http://www.uniprot.org/P08100.xml). Or http://www.uniprot.org/P08100.rdf if you want RDF. If you just want the sequence, that is at http://www.uniprot.org/P08100.fasta, or http://www.uniprot.org/P08100.gff if you want the sequence features. You might be worried about changes over time, in which case you can see all at http://www.uniprot.org/uniprot/P08100?version=*. Or if you are worried about changes in the future, then http://www.uniprot.org/uniprot/P08100.rss?version=* is the place to be. Obviously, if you want to move outward from here to the DNA sequence, or a report about the protein family, or any of the domains, then all of that is linked from here. If you don’t want to code this for yourself, there are libraries in perl, python and java which will handle these forms of data for you.

So this might be overkill, but the point is surely clear enough. It’s very easy to get the data in a multiple variety of formats, through stable identifiers. The history is clear, and the future as clear as it can be. The technology is simple, straight-forward both for humans and computers to access. The world of the biologist is a good place to be.

What does this have to do with DOIs. Let’s consider a section of publications from one of us. Of course, one of the nice things about DOIs is that you can convert them into URIs. But what do they point to? Well, a variety of different things. Maybe the full HTML article. Or, perhaps an HTML abstract and a picture of the front page. Or more links. Or, bizarrely, a list of the author biographies. Or just another image of a print out of the front page of a identified digital object.

These are a selection from our conference and journal publications. Obviously, this doesn’t cover many of our conference papers, as most don’t have DOIs unless they are published by a big publisher. Or our books. These are published by big publishers, but obviously they are books which is different. I’ve also organised or been on the PC for a number of workshops. They don’t have DOIs either. All of them do have URIs.

In no case, can we guarantee that what we see today will be the same as what we get tomorrow, even though DOIs are supposedly persistent. The presentation of the HTML on those pages that display HTML is wildly different; in many cases, there is no standard metadata. Given the DOI, there doesn’t appear to be a standard way to get hold of the metadata. If you poke around really hard on the DOI website, you may get to http://www.doi.org/tools.html. At this point, you probably already know about http://dx.doi.org, which allows you to resolve a DOI through HTTP. The list of links doesn’t take that long to work through, so you might eventually get to http://www.crossref.org. From here, you can perform searches, including extracting metadata for articles; obviously, you need to register, and you need an API key for this. It doesn’t always work, so if that fails, you can try http://www.pubmed.org, which returns metadata for some DOIs that CrossRef doesn’t, but doesn’t hold a DOI for every publication it lists (even those that have them), so it also fails in unpredictable ways.

The difference between the two situations couldn’t really be clearer. Within biology, we have an open, accessible and usable system. With DOIs, we don’t. The DOI handbook spends an awful lot of time describing the advantages of DOIs for publishers; very little is spent on the advantages for the people generating and accessing the content. It is totally unclear to us what use case DOIs are trying to address from our point of view; what ever it is, they certainly seem to fail of their purpose.

So, why do we care about this? Well, recently, we have been implementing a DOIs for kblogs. Ontogenesis articles now all have DOIs. When we were originally thinking about kblogs, our investigations on how to mint new DOIs came to very little. If DOIs are hard to use, creating them is even worse, you need a Registration Authority; setting this up within a university would be a nightmare. Compare this to the £9 credit card transaction required for a domain name (even this can be quite hard in a University setting!). In the end, we have managed to achieve this using DataCite. Ironically, they are misusing technology intended for articles to represent data; we are misusing DataCite to represent articles again. We also have to keep a hard record of our own of the DOIs we have minted, because, despite the fact all this information is stored in the Datacite database, there is no way of discovering if a DOI points at a given URL using the Datacite API, so we have no way of doing a reverse lookup from a blogpost to discover its DOI.

We’ve also created a referencing system for WordPress. This does DOI lookups for the user, currently using CrossRef, or PubMed. We are not sure yet whether we can retrieve DataCite metadata in this way also.

The irony of this is that it is all totally pointless. WordPress already creates permalinks, based on a URI. These URIs are trackback/pingback capable so can be used bi-directionally. We have added support so that URIs maintain their own version history, so that you can see all previous versions. If you do not trust us, or if we go away, then URIs are archived and versioned by the UK Web archive. Currently, we are adding features for better metadata support, which will use a simple REST style API like Uniprot. Hopefully, multiple format and subsection access will follow also.

So, why are we using DOIs at all? For the same reason as DataCite which has as one of it’s aims “to increase acceptance of research data as legitimate, citable contributions to the scientific record”. We need DOIs for kblog because, although DOIs are pointless, they have become established, they are used for assigning credit, and they are used as a badge of worth. For us, we find it unfortunate, that in the process of using DOIs, we are supporting their credentials as a badge of worth, but it seems the course of least resistance.

LaTeX to WordPress

Phillip Lord

This post describes the process of posting to WordPress from a LaTeX source file, using tools generated as part of the Knowledgeblog project.

1 Introduction

About a month ago, we managed to get funding from JISC for knowledgeblog; the idea is to turn a blog platform from something for light commentary into a framework for serious scientific publication. One of the key requirements for this is to fit in with peoples existing working practices; and for this, we need a good document creation environment. This means word and latex. I’ve been working mostly on the latter, and this post is the first outcome. It’s generated totally automatically from latex. This is an advance on my paper on realism which was semi-automatically converted, with some hand editing of the HTML.

At the moment, the tool-chain is a little bit clunky, but it will improve! This is not meant to be an annoucement that all is ready, just an early alpha release and proof-of-principle.

2 Implementation

The implementation of these tool-chain uses three pieces of software:

latextowordpress:

This package, that I have written, uses plasTeX to parse and render the latex into HTML. Most of the work is being performed by plasTeX out-of-the-box, although using a non-default configuration. Math-mode is being treated separately however, rather than using plasTeXs default image rendering approach.

blogpost:

blogpost is being used to actually post the generated HTML onto the web. The HTML can also be cut-and-paste directly into wordpress, but blogpost is easier for me, as its the usual tool I use anyway (normally over asciidoc source). Blogpost is unmodified.

mathjax-latex:

This is a wordpress plugin, that I have written, which uses MathJax to render math-mode from the original latex in the browser. The plugin just injects the mathjax javascript headers into a post on-demand (i.e. only on posts with math-mode in them).

Currently, this is all held together with some dodgy makefiles; this will be improved in time.

The first and last of these tools are available from knowledgeblog. I’ve tested them on Ubuntu 10.04 and they are in alpha. Comments are welcome, to knowledgeblog-discuss.

3 Key Features

At the moment, I haven’t fully explored all the features of LaTeX that are well supported. However, all the structural elements (sections, lists), bibliographies, links via the hyperref package all seem to work well.

The math mode rendering works well. I’ve been using one famous equation: \(E=mc^2\), as my main test. But more complex examples work also. This is from mathjax:\(J_\alpha (x) = \sum _{m=0}^\infty \frac{(-1)^ m}{m! \, \Gamma (m + \alpha + 1)}{\left({\frac{x}{2}}\right)}^{2 m + \alpha }\).

I’ve made a few tweaks to this also for common idioms. So the lesser than symbol is written in mathmode in latex but rendered directly in HTML: <.

4 Future Work

There are many things left to do yet. The process needs to made smooother, with a single tool to hook the current tool-chain together; it would be good to attach a PDF generated from the latex also. Currently, titles are set independently (which is why this post appears to have two titles). The mathjax plugin needs configuration options (it overwrites wp-latex functionality at the moment). And there is significant testing to do to see what advanced features (figures critically!) work and don’t work. Still, it’s good to see that most of the tools that I needed to get this work already existed. With luck, most of the other tools we need will be as good.

 

I’m very pleased that our grant for knowledgeblog has been accepted by JISC. I shall follow the tradition that I set with my last post, of publishing all my primary scientific output on this blog. In this case, I’m using Word, which like the latex that I used last time isn’t perfect. Still improving this process is part of the knowledgeblog proposal, so this post is also attacking a key deliverable for the grant!

The main content for this post is also available on the knowledgeblog events blog.

 

Outline Project Description

The project extends existing blogging tools for use as a lightweight, semantically linked publication environment. This enables researchers to create a hub in the linked-data environment, that we call knowledge or k-blogs. K-blogs are convenient and straight-forward for authors to use, integrating into researchers existing work practices and tools. The provide readers with distributed feedback and commenting mechanisms. We will support three communities (microarray, public health and workflow), providing immediate benefit, in addition to the long term benefit of the platform as a whole. Additionally, this will enable a user-centric development approach, while showcasing the platform as the basis for next generation research publishing. 1. Introduction

1This document describes a proposal for a project within the JISC “Managing research Data” call. Data comes in many forms, from raw statistics, to highly structured databases, through to textual reports; natural language, although hard to search and manage, is still the richest form of representation; data in the form of reports and publications are the central hub around which all other data sit. This project, therefore, will provide a lightweight, yet extensible, framework for scientific publishing, incorporating a software-supported peer-review process. Bi-directional links will be maintained both between publications and to other forms of data, using semantic markup to enhance the meaning of these links. We will also customize this framework for three communities which, as well as being directly useful, will provide real-world requirements. The project will largely develop “glue” between existing, widely-used, open-source software systems, ensuring its sustainability and usefulness past the end of the funding.

2. Fit to Programme Objectives and Project Outline


2The project call identifies the complexity and hybrid nature of the UK research data environment; despite this, one central focal point remains — most researchers spend considerable amounts of time discussing their data in the form of “paper” publications. For some, more theoretical disciplines, such as parts of computer science, the paper is the sole output; in others, such as biology, datasets are associated with papers and the barriers between “publication” and “data” are breaking down; most data sources in biology are rich in annotation; text that supports and explains the raw data. It is normally the annotation, not the raw data, which defines the quality of the resource. In these cases, text is an intrinsic part of the data.

3However, the conventional publication process has changed relatively little; the adoption of web technologies have largely been used as a distribution mechanism. Publications are still expensive — either at subscription or publication time, depending on the business model of the publisher, and involve considerable, time-consuming interactions between author and publisher, often relating to display and presentation issues. This is in stark contrast to, for example, the biological data centres where both raw and annotated data are often made available within hours of their generation.

4This situation is unfortunate because it limits the ability of researchers to customise their publication process for the requirements of their own discipline. As demonstrated by Shotton et al, and Rousay et al, it is possible to add considerable value, both enhancing the paper for the reader, as well as providing direct and semantically enhanced links to underlying data. The cost of the existing process, however, makes this form of publication unlikely for some data; for example, few scientists publish papers about negative results, resulting in an acknowledged publication bias,. As a result, it is hard for the semantically enhanced publication to take its place as the central hub for a linked data environment as envisioned by Coles and Frey, linking to and between research datasets, and the published knowledge about these datasets.

5In the last decade, the blog has become a common, web-based publication framework. There are now numerous off-the-shelf tools and platforms for managing blogs, providing a high-degree of functionality. Many scientists blog about their work, about other published work (research blogging) or “live blog” about conferences and talks as they happen. In this case, the researcher is in-charge of their own publication environment, can extend it to their requirements, and publication happens immediately. However, the blog has not yet become a standard means of publication for primary research output.

6Recently, as part of the EPSRC funded Ontogenesis network (ref), we trialled the Knowledge Blog process; in this case aimed at producing an educational resource describing many aspects of ontology development and usage, which might previously have been published in book form. We have shown that with this technology base, it is possible to replicate many of the features of the open peer-review, scientific book publication process; following two small meetings, we have written around 20 articles, and the website maintains around 1000 post reads per month (not simple hits!). To achieve this, we used only two features of the blog — trackbacks (bidirectional links) and categories (hierarchical keywords); although we used the WordPress blogging software, these features are supported by most other systems. We call these articles k-blogs.

7Currently, however, the k-blog process is not fully supported with blog software alone, nor does it fully support the referencing, advanced linking and provenance needed specifically for research publications. For this project, we propose to provide extensions to support data-rich publications, deeply and semantically linked to other k-blogs and to other forms of data repository. Therefore, the project addresses the objectives and aims of the call through four main workpackages.

1) A documented k-blog process (WP1.1) describing different levels of  peer-review suitable for different forms of research data. An implementation (WP1.2), the k-blog platform, of these process based around open-source, off-the-shelf software.

2) Extensions to the k-blog platform supporting linking. This includes full support for referencing including COINS metadata on posts (WP2.1), client-side and permanently linked versions (WP2.2) and bidirectional links (WP2.3) to other data sets. We will add semantics to these links using the Citation Ontology (CiTO) (WP2.4).

3) Support for three specialist environments—healthcare (WP3.1), microarray (WP3.2) and workflows (WP3.3). All useful in their own right and showcasing the extensibility of the framework.

4) Documentation and tooling to integrate the k-blog process into scientists existing working practice and tooling; scientists will be able to publish from Word, OpenOffice, Google Docs or LaTeX (WP4.1). We will add tooling and documentation, as WP4.2, to support the use of reference management tools such as Endnote, Mendeley or Zotero, making use of deliverables from WP2.

3. Quality of proposal and Robustness of Workplan

 

3.1 WP1: Knowledge Blog Process



8In this project, we aim to develop a light-weight publication framework, including the desirable aspects of the formal peer-review process. However, different forms of scientific publication require different levels of peer-review. For example, for http://ontogenesis.knowledgeblog.org, we require two reviews from an editorial board, assessing quality, appropriate for an educational resource. However, for http://process.knowledgeblog.org, which is intended to contain informal “how-to” and request for comment documents, a much lighter-weight, single editorial review assessing scope alone is more appropriate. Deliverable WP1.1 will consist of documentation describing both formally and informally, a number of levels for the knowledge blog process, and how these can be achieved using a blog. These documents will, themselves, be published on http://process.knowledgeblog.org.

9These processes will be implemented as Deliverable WP1.2, comprising freely available and widely used pieces of software, with additional “glue”. The basic publication framework will use WordPress 3 (WoP) — an open-source, multi-site, multi-author blogging system used to provide the hosted blog service at http://www.wordpress.com. While, we have found that WoP supports many aspects of this process, particularly from the readers perspective, a significant degree of “book-keeping” is required from authors, reviewers and editors. Readers know whether a paper has been reviewed or not, but authors have to remember for themselves who is reviewing the paper. Therefore, we will use a “ticket system”, specifically Request Tracker 3 (RT) (http://bestpractical.com/rt/). Both WoP and RT are extensible with plugins and will be extended and adapted to reflect the k-blog levels of WP1.1.

10We will use this extensibility to provide a light-weight integration. RT operates as an email response system; by extending WoP to send email on submission of new papers, this can provide both an integration point, as well as the main point of interaction for authors, reviewers and editors. To provide editorial and reviewer functionality tickets can be moved between queues; extensions to RT will use standard blogging XML-RPC calls to feedback to WoP by, for example, re-categorising papers once accepted. OpenID (http://openid.net) will be used to integrate the user accounts between the two systems. WoP already supports this fully, while RT supports it in skeleton form.

11Although we will provide an implementation of the k-blog process, it will be described sufficiently generically to support complete and independent implementation.

 

3.2 WP2: References and Metadata
12For k-blogs to become an integral part of the scientific record, they must fully support the semantic and linked data environment. Although WoP supports standard URI based linking to resources, and bidirectional “trackback” linking to other resources, it lacks complete functionality suitable for research communities. This is a rare example of functionality that is not already provided by WoP or an associated plugin. Deliverable WP2.1 will fulfil this need; we will support the insertion of at least DOIs and PubMed IDs (PMID), that will be resolved to full human-readable reference lists for display, using APIs provided by CrossRef and NCBI eUtils respectively. To fully support computational agents wishing to access the same information, references will also support COinS metadata, embedded into the display HTML.

K-blog posts will also require outward facing metadata, that describe the resources they provide in a standards-compliant manner. The Open Archives Initiative (OAI) provide standards that aim to facilitate the efficient dissemination of content. Specifically, the Object Reuse and Exchange specification (OAI-ORE) is a standard for the description and exchange of compound digital objects  (such as a WoP post or page). The WordPress OAI-ORE plugin provides link header elements that implement this specification.

13Our initial investigations into the k-blog process showed that WoP support for versioning and provenance are lacking; the k-blog process involves updating papers after submission but before final acceptance. While WoP stores all these versions, these are only currently visible by authors or editors through the administration interface. Whilst existing plugins for WoP already provide some of this functionality, Deliverable WP2.2 will uncover these to readers, along with a defined permalink scheme for access to all versions, providing full provenance.

14WoP supports bi-directional links in the form of trackbacks; this is mediated by XML-RPC calls between resources when a link is made. This will support linking to data where, for example, the data is another k-blog; however, general data resources may lack support for this process. Therefore, as Deliverable WP2.3, we will provide a trackback proxy, hosted on the http://knowledgeblog.org server, storing and presenting these links for resources that cannot directly process trackbacks.

15To complete this work package, we will add semantics to the links using CiTO, as Deliverable WP2.4. Therefore, as well as enabling easier data linking and provenance, we will also enable addition of meaning to these links.

 

3.3 WP3 – Specialist Environments



16The k-blog platform and process is designed to be flexible and adaptable to the needs of specialist environments. We will use three main use cases to ensure real world applicability of the software, as well as fulfilling the immediate needs of these communities.

17For Deliverable WP3.1, we will add additional features for supporting the microarray community. Currently, the microarray community is well serviced in terms of metadata capture (MIAME) and deposition in public repositories (ArrayExpress, GEO). As part of WP2, we will support linking to these datasets through stable URIs. However, these resources deal only with data generation. Post-processing and analysis is largely captured at the publication stage, often in supplementary material.

18A substantial amount of this analysis uses BioConductor: a widely used, open-source platform for statistical microarray analysis based on the R statistical programming language. We will extend k-blog with specific support for R and BioConductor. Authors will be able to directly embed code into k-blog papers, along with the figures that result; as a result reviewers and readers will be able to see a computationally precise description of methods and replicate the generation of figures should they choose.

19Finally, we will investigate the possibility of publication to a k-blog using only R code and references to public databases, in a process similar to Sweave — figures will be generated on the server, provide guarantees of correctness and precise provenance. The limited scope of this call means this part of WP3.1 will be proof-of-principle only.

20For WP3.2, we will focus on the public health community (PHC): a key workforce in delivering quality and effective healthcare by providing timely and accurate public health intelligence (PHI),. PHI is a varied environment performing statistical analyses: producing information figures, diagrams and reports to communicate results to the wider health community. However, the PHC operates in small groups with little knowledge networking. The main aim of the k-blog is to improve the availability of health information, data and knowledge, to inform decisions for health protection and care standards as supported by the Quality Improvement Productivity and Prevention initiative. The NWeHealth e-Lab project, hosted at The University of Manchester, provides an environment to bring together research objects into a single location. As elsewhere, textual data forms the key hub that links together all the other forms of knowledge. By linking to e-Lab research objects from a k-blog, this link will be made explicit, available, interpretable and directly valuable to the PHC; as a result WP3.2 is synergistic with the rest of the proposal. This community also bring a set of access control requirements. To support these we will use existing WoP facilities, providing a simple, easy-to-use three level access model.

 

20For WP3.3, we will generate k-blog content about Taverna workflows and methods for building them. Workflows have become a popular way of realizing computational analyses and have become an important form of data. The JISC funded myExperiment project is widely used to disseminate the workflows themselves. Knowledge about issues surrounding workflows is, however, more difficult to produce and disseminate. A k-blog, with its ability to produce short, targeted articles as the need arises and the resources become available for writing, suits the need for taverna workflow documentation. We will seek k-blogs on Taverna issues such as: the basics of workflow design; how to choose among a set of similar services in producing a workflow; and, the testing of workflows. We will implement a light-weight mechanism, using trackbacks, to link between the k-blog and myExperiment.

 

21As part of WP3, we will also hold four workshops, at 3-month intervals, each focusing on one particular k-blog and community. These workshops will be of the form previously trialled as part of the Ontogenesis network, and will serve several purposes; requirements gathering and feedback for us, education for the community and development of content, that demonstrates the process to the general readership.

 

3.4 WP4 – Integration with Existing Working Practices



22For the k-blog process to be acceptable to communities such as those described in WP3, it must fit with existing working practices. Researchers mostly write documents using a word-processor. Fortunately, as the k-blog platform is based on the widely-used WoP, which in turns offers a widely-supported API, this style of working can be readily integrated. It is already possible to author using Word (2007 onward), OpenOffice, Google Docs and LaTeX using integrated or existing technologies, as demonstrated by our previous work at http://ontogenesis.knowledgeblog.org. For Deliverable WP4.1, user oriented documentation, describing these tools will be developed. This documentation will also describe clearly how to present and organise papers in a way which is optimized for the k-blog process. While, we expect this documentation to take a significant time-span to produce, refining it as a result of user feedback, it is important to note that a k-blog is already useful and possible.

To take maximal advantage of linking technologies developed in WP2, we will need to integrate with existing technologies for referencing. As deliverable WP4.2, we will add tooling to enable the use of bibliographic tools such as Endnote, Mendeley, Zotero or BiBTeX to insert references that k-blog can directly translate. Largely, this should consist of “styles”, modifying the in-text citation, as the reference plugin of WP2.1 will generate reference lists. As with other deliverables, this tooling will include substantial documentation, developed using the k-blog process.

4. Project Timeline

 

Name

Start

End

Staff

Notes

WP 1

02/08/2010

30/10/2010

   

WP 1.1

02/08/2010

31/08/2010

All

A documented k-blog process

WP 1.2

01/09/2010

30/10/2010

DS,SC

Implementation with off-the-shelf software

WP 2

01/11/2010

30/04/2011

   

WP 2.1

01/11/2010

26/02/2011

SC

COinS metadata on posts

WP 2.2

01/11/2010

29/01/2011

SC

Client-side, permanently linked versions

WP 2.3

03/01/2011

26/02/2011

DS

Bi-directional links to other datasets

WP 2.4

01/03/2011

30/04/2011

PL

Semantic linking with CITO

WP 3

01/11/2010

30/07/2011

   

WP 3.1

01/11/2010

30/07/2011

GM

Specialist environment – Healthcare

WP 3.2

01/11/2010

30/07/2011

DS

Specialist environment – Microarrays

WP 3.3

01/11/2010

30/07/2011

RS

Specialist environment – Workflows

WP 4

02/08/2010

30/06/2011

   

WP 4.1

02/08/2010

30/04/2011

GM,DS

Authoring documentation and tools

WP 4.2

02/05/2011

30/06/2011

GM,SC

Referencing documentation and tools

 

5. Project Management Arrangements

23The project will be managed from Newcastle University; the primary management will be from Dr Lord who will be responsible for:

  • Developing Project Management Plans;
  • Ensuring that the Project technical objectives are met;
  • Prioritising and reconciling conflicting opportunities;
  • Reporting and collaborating with JISC programme Manager;
  • Dissemination of the k-blog platform.

Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via skype or phone as appropriate. Although most project staff are co-located, primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. User consultation will be via public mailing list, as well as through a “dogfooding” k-blog. All project staff have been handpicked; they are highly experienced and self-directed, as outlined elsewhere. All are associated with several other projects and duties (research, research support, teaching and training), and are responsible for managing these independent workloads.

 

5.1 Risks

24Staff Risk – as with all projects, loss of staff could negatively impact on this project; however, all staff are on permanent contracts, have long histories in research, so this is less likely. Additionally, by dividing the work between five individuals, we limit the risk should a single person leave.

WoP3 and other dependencies – the project depends on other software, most notably WoP for which a new version (3.0) is now in beta; however the software is widely supported. Other software is replaceable.

Standards Shifting – the project depends on a number of standards and these may change. In this project, we will NOT support standards, but rather use those that support us. Where standard change rapidly, their implementation will be delayed (till they stabilize) or dropped. None of the standards described here is critical to the success of the project.

 

5.2 IPR Position

25All code will be developed under open source licences. WoP and RT are licensed under GPL, so code linking to these will be likewise licensed. Code that is separable will be released under LGPL. Code will remain copyright of respective institutions or authors. Any documentation produced by project staff relating to the project will be licensed under Creative Commons Attribution license. Licensing of individual k-blogs will be delegated, but permissive licenses will be encouraged.

 

5.3 Sustainability

26This project is largely based around innovative, novel and leading use of existing software. As such the sustainability of the majority of the technology base is not dependent on project members but large companies with established and proven business models. The k-blog process will be cleanly separated from its implementation, ensuring only weak dependencies to underlying software. Where, we produce software “glue”, public and widely supported APIs will be used where possible. This will ensure that components are replaceable. All code, including historical versions will be publicly available. Documents produced by project staff will be publically available and clearly licensed so will be archived through the internet “cloud” resources; we are also seeking explicit support for archiving from the British Library.

 

5.4 Staff Recruitment

27All staff are already in post.

 

5.5 Key Beneficiaries

28Our key beneficiaries are the public health, microarray and workflow communities; as the k-blog process is based around commodity software, these groups can use the basic environment from the first day of the project to generate and share content. As the project progresses, so will the process, the software to support it and the documentation to explain it; at all stages, the k-blog process fulfils a clear and immediate need. While we are specifically targeting these communities, the k-blog process and platform is sufficiently generic that it can support a wide range of research activities.

Although presented here as a single platform, the process and components are separable and can benefit communities independently. In particular, the tools and documentation from WP2 and WP4 will find use within the research blogging community, who find, in particular, the lack of tooling for referencing difficult. Finally, the statement of a peer-review process, and its implementation within RT will be applicable to any peer-review environment regardless of the form of publication. This includes publications published using wiki or other Content Management Systems.

 

5.6 Engagement with Community

29We consider the mechanism for engagement with four kinds of community: engagement with our core content generating community is an intrinsic part of this proposal, as described in WP3. Further interaction with more disparate groups will be maintained through personal contacts; each of the five individuals named in this proposal are experienced and embedded in different communities (health care, microarray, ontology, proteomics). Engagement with our core content consuming community is, again, an intrinsic part of the proposal; all project communications will be via open mailing list or k-blog. Project members are active users of Web 2.0 social technologies; our initial trials as part of Ontogenesis showing this approach to be highly effective form of dissemination, with minimal effort. Engagement with software users will be via website and direct interaction. All software will be released or advertised via normal channels (website, versioning, and mailing list), including a (debian) package repository for those wishing to set up their own server. Finally, developer communities will not be specifically targeted, but our open source, continually integrated development plan will be attractive, and we will accept suitably licensed contributions.

30All communities will benefit from the open and agile development methodology we will adopt; changes to the environment will be integrated and released rapidly, ensuring continual improvement and facilitating rapid feedback cycles.

 

6. Previous Experience and Project Team

 

31Dr. Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He was an active participant in the Ontogenesis network, and developed the initial idea for knowledge blogs as part of this. He is an active blogger and developer.

 

32Dr. Georgina Moulton is an Education and Development Fellow at The University of Manchester. Since 2005 her main roles have been to co-ordinate the development, and delivery of multi-disciplinary bio/health informatics education programmes; and to facilitate the engagement of biological and health communities in a variety of bio and health informatics research projects (e.g., ONDEX, Obesity e-Lab). For 3 years, Georgina was the EPSRC funded Ontogenesis Network Manager, in which she co-ordinated the activities of the network and expanded the network through the facilitation of the development of new activities and was involved in the trial k-blog process. More recently her work includes the development and delivery in conjunction with NHS partners of an education and development programme tailored to match the needs of North West public health analysts and the wider healthcare workforce.

 

33Dr. Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001.  Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities.  Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data.  His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio-Linux) as well as acting in a research capacity analysing high-throughput data.

 

34Dr. Simon Cockell has a PhD in Genetics from Leicester University, and refocussed into Bioinformatics with a Masters degree from Leeds in 2005. From there he moved to Newcastle, and the Bioinformatics Support Unit. Since coming to Newcastle, Simon has worked on a range of projects involving large scale analyses (AptaMEMS-ID), data integration (Ondex) and health informatics (MRC Mitochondrial Disease Cohort). 

 

35Dr Robert Stevens is a senior lecturer in Bioinformatics in the Bio and Health Informatics group at the University of Manchester. His main areas of research are in the development and use of semantics within the life sciences. This is blended with the use of e-Science platforms to gather and manage the data and knowledge of the life sciences. He was PI on the Ontogenesis network that ran the meetings for the first k-blog. He is or has been a co-investigator on the myGrid and myExperiment grants that will provide both content and technical input to this project. As well as the JISC funded myExperiment project, Stevens was an investigator on the JISC funded CO-ODE project that developed Protégé 4. On the back of this, Stevens has led the OWL training activities at Manchester that has directly fed in to the Ontogenesis k-blog. This range of experience makes Stevens an ideal partner to lead the development of content within this project.

 

While travelling on Elba, I suffered the misfortune of a virus attack; I don’t use AV software these days, since it tends to break other things which take a long time to fix, and it’s been many years since I’ve lost a machine to malicious software.

The process, though, was quite entertaining. First, I started getting an error stating that system.exe needed .net to run properly. After a while, a Windows update happened, along with the normal malicious software removal update. This found the virus, probably killed it, then stuck up a dialog saying “Some of your files were nasty, so they need to be restored, please insert your Windows SP3 disk”. Clicking “ok” said “I can’t find the disk, perhaps a) you put the wrong disk in or b) your drive isn’t working”. Or c) you are on holiday, and your disk is 1000 miles away, and anyway, the machine is old enough to have come with SP2. All sort of raising the question why the software that I’d just downloaded from Microsoft, can’t download the system components to replace the ones that it’s deleted from Microsoft also.

After the reboot, all trace of networking software had been blitzed from the machine; I couldn’t even use loopback addresses. In the end, I’ve done a complete factory reset from the recovery partition which I thought I had deleted years ago. The process took about 15 minutes to recover windows, 1 hour to recover the sony application layer, 2 hours to remove all the sony application layer (one application at a time, including the 10 different wallpaper packages, because add/remove programs doesn’t allow multiple select), except for the power management tweaks and drivers, then another hour trying to figure out NTFS file permissions so that I could read my files.

Actually, the process hasn’t been a complete loss; I was thinking of re-installing the OS anyway. The boot had got to around 3 to 4 minutes which was getting daft. Now, with a clean OS, a complete reboot takes under a minute. It’s also been a bit of a walk down memory lane; currently, I have no internet, so my computer is in 2005 state; there is Office 2003 trial, a rubbishy media centre thing Sony probably wrote as an answer to iTunes, and macromedia flash. I had an emacs install exe in my recycle which I managed to recover before the reset so, ironically, Emacs is the newest piece of software I have on here.

Still, this administrative nightmare makes me wonder what to do next; XP is not too long for this world, vista is a poll tax on wheels, and I am just not sure I can be bothered with learning 7. I’ve used windows on the road for a long time, but I think I may go small, light netbook running linux. There have been a couple of times when I have needed MS Office, but it’s not that common, and there is always a work-around.

At Neuroinformatics 2009, David Sutherland and I talked about the problems of ontology building. One of the current (and past!) difficulties is to choose an appropriate language for representing the knowledge in your ontology. I thought I would write my thoughts up as a post; this will probably result in the most boring thing I have ever written (I am sure someone will point out worse offenses); syntax is dull but distressingly important.

In bioinformatics, there are essentially two choices that is OWL and OBO (format). A second issue, is finding a good environment for developing the ontology; this divides between Protege, OBO-Edit and the ever-present “text editor”. It’s often the case, that we want to use both of these at the same time. Take, for example, OBI, which I am involved in. While the ontology itself is being developed in OWL, many of its dependent ontologies are built using OBO; being purist and demanding one is really not an option. OWL itself has many different syntaxes; at the moment, I generally prefer Manchester sytnax because you can edit it with text-editor, which is really not so easy with any of the XML representations.

While these two languages have somewhat different expressivity, there have been a number of descriptions of how to translate both the syntax and the semantics which have been described elsewhere. One of the recurrent problems, however, stems from the best practices and the syntax of identifiers.

OBO makes use of a numerical, semantics-free identifier and a namespace, with a syntax of NAMESPACE:IDENTIFER. So, a Gene Ontology term looks like GO:0003674. The namespace is not constrained to be two-letters and has mechanisms for world-uniqueness, in that people talk to each other and sort it out, if they clash. The use of a semantics-free identifier means that term names can be changed while maintaining the implied meaning with the term; the label for the term, meanwhile, provides a human readable version, which can be shown to users of the ontology. I will call these the OBO identifier and OBO label respectively.

Translating this, however, into OWL, including Manchester syntax causes significant problems. The naturalistic translation is to turn the OBO identifier onto the identifier in OWL; the OBO namespace would become an XML namespace, the OBO identifier would become an XML identifier. Unfortunately, this doesn’t work. First, the OBO identifier is genuniely just a short string and XML requires a URI; so a mapping between OBO identifiers and URIs is necessary. Second, the OBO identifier is numerical; unfortunately, while the identifiers in OWL can contain numbers they have to start with a non-numerical character. The standard translation, therefore, uses in most cases an OBO wide URL (http://purl.obolibrary.org/obo/), although some ontologies have their own namespace (GO uses http://purl.org/obo/owl/GO#). The OBO identifier is mapping to an valid identifer by sticking a prefix onto the numbers. So, we have identifiers such as GO:GO_0042101 or obo:OBI_1110045. There are also some OBO ontologies for which this does NOT occur; for instance, BFO classes in OBI come out with identifiers of the form snap:Continuant or span:Process, except for one which is bfo:Entity.

Again, all perfectly reasonable, but unfortunately, when converted to Manchester syntax it means that we end up with classes that look like this slightly elided class from OBI:


Class: obo:OBI_1110161

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

which completely defeats the aim of a human-readable syntax. Now OBO format has much the same problem; relationships to other classes are specified using cross-referenes to their identifiers which are, essentially, unreadable. OBO format works around this with a denormalisation as can be seen from this somewhat elided example from IAO:


[Term]
id: IAO:0000027
name: data item
def:"a data item is an information content entity that is intended...."
is_a: IAO:0000030 ! information content entity

The cross reference in this case is a subsumption link to IAO:0000030

One solution would be to use the rdfs:label in place of the identifier. So, we would have something that looked like this:


Class: "T cell epitope ELISA IL-1b assay" @en

    Annotations:
        obo:identifier "1110161"

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

Other identifiers would also have to be changed, also. I’ve also added the odo:identifier line (which I think would be valid, but might require the creation of an OWL individual). Without this, it would not be possible to go backward.

However, this is problematic as it changes the serializiation between the OWL Manchester syntax and other syntaxes of OWL. The class identifier has to be URI legal, and OBO label here is not. We could do a syntactic conversion (e.g. T%20%cell%20%epitope) but this, again, reduces readiblity, defeating the point. Also, the rdfs:label would become part of the final identifier URI, which then becomes a semantics heavy identifier. Finally, it would require a OBO specific loading of the Manchester syntax, taking the URI identifier from the annotation block, and the rdfs:label from the class name.

So, is there any solution. First, there are tooling solutions. In Protege, it is already possible to use any component of the definition in the display. So, you can set the rdfs:label as the main display form. Tooling solutions are attractive, but there is a problem; you have to extend all tools to support this view; I realise that the number of freaks who wish to edit OWL with emacs is not that large, so this might not seem an issue. However, many people wish to develop ontologies collaboratively using version control; if you want to compare versions you use diff, so we now need an Manchester syntax diff viewer. Also, if you want to do some perl hacking, or straight-forward search and replace, again, it’s all harder.

To some extent this might seem trivial, but then the entire purpose of Manchester syntax (and the functional syntax) is to have an easy to read and manipulate syntax which the XML version of OWL is not. This purpose is defeated if it’s hard to read.

So, a second non-tooling solution. The obvious answer is to take the OBO approach and add comments. Now, the Manchester syntax includes a comment character (#), although last time I tried the Protege parser doesn’t implement this. None then less, it allows this:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661,
        obo:OBI_0000299 some (obo:IAO_0000109
        and (obo:IAO_0000136 some obo:OBI_1110196))

This is not too bad, but it doesn’t work well for complex class expressions. I can’t be bothered to look up the labels and have reused one, but you get something like:


Class: obo:OBI_1110161 #"T cell epitope ELISA IL-1b assay"@en,

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661, #"T cell epitope ELISA IL-1b assay"@en
        obo:OBI_0000299 #"T cell epitope ELISA IL-1b assay"@en
        some (obo:IAO_0000109 #"T cell epitope ELISA IL-1b assay"@en
        and (obo:IAO_0000136 #"T cell epitope ELISA IL-1b assay"@en
             some obo:OBI_11101 #"T cell epitope ELISA IL-1b assay"@en
             ))

This has three problems. Firstly, we have used comments “meaningfully” as we can’t distinguish between these comments and other normal comments. Secondly, we have had to reformat the output because we have only a “to-end-of-line” comment character. Thirdly, it looks horrible.

So, my minimal solution would be this; we introduce some new comment characters, which are treated as comments normally, but which carry enough semantics to allow a warning when they are wrong; rather like Javadoc, which is a comment wrt the language, but is structured and meaningful wrt the documentation. Tooling could be used to check that the comment masquerading labels are correct wrt to the identifiers.


Class: obo:OBI_1110161 [T cell epitope ELISA IL-1b assay],

    Annotations:
        rdfs:label "T cell epitope ELISA IL-1b assay"@en,

    SubClassOf:
        obo:OBI_0000661 [blah],
        obo:OBI_0000299 [longer blah]
        some (obo:IAO_0000109 [more]
        and (obo:IAO_0000136 [stuff]
        some obo:OBI_11101 [OBI Thing]
        ))

This is still not ideal; it would require extension to Manchester syntax, but it’s minimal, and it does support the semantics free identifiers in OBO in a way which does not require extensive tooling. It’s worth reiterating here that OBOs semantics-free identifiers are a good thing; so, supporting them supports others people who may wish to do the same, sensible thing. It does have the disadvantages of duplicating information, but at least in a way that is checkable.

Comments welcome!

Make has been driving me mad for the last week. It keeps on complaining about “modification time in the future”. Normally, this happens because you’re using rmeote files from a server which doesn’t have sync’d time. But this is rare these days. Anyway, it was complain that the file was 10E+06 seconds in the future; that’s a really, really big clock skew.

Did a bit of poking around. One possibility I found was that it was due to a limitation in FAT32; hmmm, not likely. Didn’t have time for more. I am at a conference; supposed to be paying some attention.

Anyway, the solution came to me today. Or rather the cause, because the solution was obvious. Turns up, when I changed timezone to Czech, I pushed the month back to August. What I don’t understand is that I was sure windows synced to a NTP server running somewhere. What does it do when you change the month?