I’m very pleased that our grant for knowledgeblog has been accepted by JISC. I shall follow the tradition that I set with my last post, of publishing all my primary scientific output on this blog. In this case, I’m using Word, which like the latex that I used last time isn’t perfect. Still improving this process is part of the knowledgeblog proposal, so this post is also attacking a key deliverable for the grant!

The main content for this post is also available on the knowledgeblog events blog.

 

Outline Project Description

The project extends existing blogging tools for use as a lightweight, semantically linked publication environment. This enables researchers to create a hub in the linked-data environment, that we call knowledge or k-blogs. K-blogs are convenient and straight-forward for authors to use, integrating into researchers existing work practices and tools. The provide readers with distributed feedback and commenting mechanisms. We will support three communities (microarray, public health and workflow), providing immediate benefit, in addition to the long term benefit of the platform as a whole. Additionally, this will enable a user-centric development approach, while showcasing the platform as the basis for next generation research publishing. 1. Introduction

1This document describes a proposal for a project within the JISC “Managing research Data” call. Data comes in many forms, from raw statistics, to highly structured databases, through to textual reports; natural language, although hard to search and manage, is still the richest form of representation; data in the form of reports and publications are the central hub around which all other data sit. This project, therefore, will provide a lightweight, yet extensible, framework for scientific publishing, incorporating a software-supported peer-review process. Bi-directional links will be maintained both between publications and to other forms of data, using semantic markup to enhance the meaning of these links. We will also customize this framework for three communities which, as well as being directly useful, will provide real-world requirements. The project will largely develop “glue” between existing, widely-used, open-source software systems, ensuring its sustainability and usefulness past the end of the funding.

2. Fit to Programme Objectives and Project Outline


2The project call identifies the complexity and hybrid nature of the UK research data environment; despite this, one central focal point remains — most researchers spend considerable amounts of time discussing their data in the form of “paper” publications. For some, more theoretical disciplines, such as parts of computer science, the paper is the sole output; in others, such as biology, datasets are associated with papers and the barriers between “publication” and “data” are breaking down; most data sources in biology are rich in annotation; text that supports and explains the raw data. It is normally the annotation, not the raw data, which defines the quality of the resource. In these cases, text is an intrinsic part of the data.

3However, the conventional publication process has changed relatively little; the adoption of web technologies have largely been used as a distribution mechanism. Publications are still expensive — either at subscription or publication time, depending on the business model of the publisher, and involve considerable, time-consuming interactions between author and publisher, often relating to display and presentation issues. This is in stark contrast to, for example, the biological data centres where both raw and annotated data are often made available within hours of their generation.

4This situation is unfortunate because it limits the ability of researchers to customise their publication process for the requirements of their own discipline. As demonstrated by Shotton et al, and Rousay et al, it is possible to add considerable value, both enhancing the paper for the reader, as well as providing direct and semantically enhanced links to underlying data. The cost of the existing process, however, makes this form of publication unlikely for some data; for example, few scientists publish papers about negative results, resulting in an acknowledged publication bias,. As a result, it is hard for the semantically enhanced publication to take its place as the central hub for a linked data environment as envisioned by Coles and Frey, linking to and between research datasets, and the published knowledge about these datasets.

5In the last decade, the blog has become a common, web-based publication framework. There are now numerous off-the-shelf tools and platforms for managing blogs, providing a high-degree of functionality. Many scientists blog about their work, about other published work (research blogging) or “live blog” about conferences and talks as they happen. In this case, the researcher is in-charge of their own publication environment, can extend it to their requirements, and publication happens immediately. However, the blog has not yet become a standard means of publication for primary research output.

6Recently, as part of the EPSRC funded Ontogenesis network (ref), we trialled the Knowledge Blog process; in this case aimed at producing an educational resource describing many aspects of ontology development and usage, which might previously have been published in book form. We have shown that with this technology base, it is possible to replicate many of the features of the open peer-review, scientific book publication process; following two small meetings, we have written around 20 articles, and the website maintains around 1000 post reads per month (not simple hits!). To achieve this, we used only two features of the blog — trackbacks (bidirectional links) and categories (hierarchical keywords); although we used the WordPress blogging software, these features are supported by most other systems. We call these articles k-blogs.

7Currently, however, the k-blog process is not fully supported with blog software alone, nor does it fully support the referencing, advanced linking and provenance needed specifically for research publications. For this project, we propose to provide extensions to support data-rich publications, deeply and semantically linked to other k-blogs and to other forms of data repository. Therefore, the project addresses the objectives and aims of the call through four main workpackages.

1) A documented k-blog process (WP1.1) describing different levels of  peer-review suitable for different forms of research data. An implementation (WP1.2), the k-blog platform, of these process based around open-source, off-the-shelf software.

2) Extensions to the k-blog platform supporting linking. This includes full support for referencing including COINS metadata on posts (WP2.1), client-side and permanently linked versions (WP2.2) and bidirectional links (WP2.3) to other data sets. We will add semantics to these links using the Citation Ontology (CiTO) (WP2.4).

3) Support for three specialist environments—healthcare (WP3.1), microarray (WP3.2) and workflows (WP3.3). All useful in their own right and showcasing the extensibility of the framework.

4) Documentation and tooling to integrate the k-blog process into scientists existing working practice and tooling; scientists will be able to publish from Word, OpenOffice, Google Docs or LaTeX (WP4.1). We will add tooling and documentation, as WP4.2, to support the use of reference management tools such as Endnote, Mendeley or Zotero, making use of deliverables from WP2.

3. Quality of proposal and Robustness of Workplan

 

3.1 WP1: Knowledge Blog Process

8In this project, we aim to develop a light-weight publication framework, including the desirable aspects of the formal peer-review process. However, different forms of scientific publication require different levels of peer-review. For example, for http://ontogenesis.knowledgeblog.org, we require two reviews from an editorial board, assessing quality, appropriate for an educational resource. However, for http://process.knowledgeblog.org, which is intended to contain informal “how-to” and request for comment documents, a much lighter-weight, single editorial review assessing scope alone is more appropriate. Deliverable WP1.1 will consist of documentation describing both formally and informally, a number of levels for the knowledge blog process, and how these can be achieved using a blog. These documents will, themselves, be published on http://process.knowledgeblog.org.

9These processes will be implemented as Deliverable WP1.2, comprising freely available and widely used pieces of software, with additional “glue”. The basic publication framework will use WordPress 3 (WoP) — an open-source, multi-site, multi-author blogging system used to provide the hosted blog service at http://www.wordpress.com. While, we have found that WoP supports many aspects of this process, particularly from the readers perspective, a significant degree of “book-keeping” is required from authors, reviewers and editors. Readers know whether a paper has been reviewed or not, but authors have to remember for themselves who is reviewing the paper. Therefore, we will use a “ticket system”, specifically Request Tracker 3 (RT) (http://bestpractical.com/rt/). Both WoP and RT are extensible with plugins and will be extended and adapted to reflect the k-blog levels of WP1.1.

10We will use this extensibility to provide a light-weight integration. RT operates as an email response system; by extending WoP to send email on submission of new papers, this can provide both an integration point, as well as the main point of interaction for authors, reviewers and editors. To provide editorial and reviewer functionality tickets can be moved between queues; extensions to RT will use standard blogging XML-RPC calls to feedback to WoP by, for example, re-categorising papers once accepted. OpenID (http://openid.net) will be used to integrate the user accounts between the two systems. WoP already supports this fully, while RT supports it in skeleton form.

11Although we will provide an implementation of the k-blog process, it will be described sufficiently generically to support complete and independent implementation.

 

3.2 WP2: References and Metadata
12For k-blogs to become an integral part of the scientific record, they must fully support the semantic and linked data environment. Although WoP supports standard URI based linking to resources, and bidirectional “trackback” linking to other resources, it lacks complete functionality suitable for research communities. This is a rare example of functionality that is not already provided by WoP or an associated plugin. Deliverable WP2.1 will fulfil this need; we will support the insertion of at least DOIs and PubMed IDs (PMID), that will be resolved to full human-readable reference lists for display, using APIs provided by CrossRef and NCBI eUtils respectively. To fully support computational agents wishing to access the same information, references will also support COinS metadata, embedded into the display HTML.

K-blog posts will also require outward facing metadata, that describe the resources they provide in a standards-compliant manner. The Open Archives Initiative (OAI) provide standards that aim to facilitate the efficient dissemination of content. Specifically, the Object Reuse and Exchange specification (OAI-ORE) is a standard for the description and exchange of compound digital objects  (such as a WoP post or page). The WordPress OAI-ORE plugin provides link header elements that implement this specification.

13Our initial investigations into the k-blog process showed that WoP support for versioning and provenance are lacking; the k-blog process involves updating papers after submission but before final acceptance. While WoP stores all these versions, these are only currently visible by authors or editors through the administration interface. Whilst existing plugins for WoP already provide some of this functionality, Deliverable WP2.2 will uncover these to readers, along with a defined permalink scheme for access to all versions, providing full provenance.

14WoP supports bi-directional links in the form of trackbacks; this is mediated by XML-RPC calls between resources when a link is made. This will support linking to data where, for example, the data is another k-blog; however, general data resources may lack support for this process. Therefore, as Deliverable WP2.3, we will provide a trackback proxy, hosted on the http://knowledgeblog.org server, storing and presenting these links for resources that cannot directly process trackbacks.

15To complete this work package, we will add semantics to the links using CiTO, as Deliverable WP2.4. Therefore, as well as enabling easier data linking and provenance, we will also enable addition of meaning to these links.

 

3.3 WP3 – Specialist Environments

16The k-blog platform and process is designed to be flexible and adaptable to the needs of specialist environments. We will use three main use cases to ensure real world applicability of the software, as well as fulfilling the immediate needs of these communities.

17For Deliverable WP3.1, we will add additional features for supporting the microarray community. Currently, the microarray community is well serviced in terms of metadata capture (MIAME) and deposition in public repositories (ArrayExpress, GEO). As part of WP2, we will support linking to these datasets through stable URIs. However, these resources deal only with data generation. Post-processing and analysis is largely captured at the publication stage, often in supplementary material.

18A substantial amount of this analysis uses BioConductor: a widely used, open-source platform for statistical microarray analysis based on the R statistical programming language. We will extend k-blog with specific support for R and BioConductor. Authors will be able to directly embed code into k-blog papers, along with the figures that result; as a result reviewers and readers will be able to see a computationally precise description of methods and replicate the generation of figures should they choose.

19Finally, we will investigate the possibility of publication to a k-blog using only R code and references to public databases, in a process similar to Sweave — figures will be generated on the server, provide guarantees of correctness and precise provenance. The limited scope of this call means this part of WP3.1 will be proof-of-principle only.

20For WP3.2, we will focus on the public health community (PHC): a key workforce in delivering quality and effective healthcare by providing timely and accurate public health intelligence (PHI),. PHI is a varied environment performing statistical analyses: producing information figures, diagrams and reports to communicate results to the wider health community. However, the PHC operates in small groups with little knowledge networking. The main aim of the k-blog is to improve the availability of health information, data and knowledge, to inform decisions for health protection and care standards as supported by the Quality Improvement Productivity and Prevention initiative. The NWeHealth e-Lab project, hosted at The University of Manchester, provides an environment to bring together research objects into a single location. As elsewhere, textual data forms the key hub that links together all the other forms of knowledge. By linking to e-Lab research objects from a k-blog, this link will be made explicit, available, interpretable and directly valuable to the PHC; as a result WP3.2 is synergistic with the rest of the proposal. This community also bring a set of access control requirements. To support these we will use existing WoP facilities, providing a simple, easy-to-use three level access model.

 

20For WP3.3, we will generate k-blog content about Taverna workflows and methods for building them. Workflows have become a popular way of realizing computational analyses and have become an important form of data. The JISC funded myExperiment project is widely used to disseminate the workflows themselves. Knowledge about issues surrounding workflows is, however, more difficult to produce and disseminate. A k-blog, with its ability to produce short, targeted articles as the need arises and the resources become available for writing, suits the need for taverna workflow documentation. We will seek k-blogs on Taverna issues such as: the basics of workflow design; how to choose among a set of similar services in producing a workflow; and, the testing of workflows. We will implement a light-weight mechanism, using trackbacks, to link between the k-blog and myExperiment.

 

21As part of WP3, we will also hold four workshops, at 3-month intervals, each focusing on one particular k-blog and community. These workshops will be of the form previously trialled as part of the Ontogenesis network, and will serve several purposes; requirements gathering and feedback for us, education for the community and development of content, that demonstrates the process to the general readership.

 

3.4 WP4 – Integration with Existing Working Practices

22For the k-blog process to be acceptable to communities such as those described in WP3, it must fit with existing working practices. Researchers mostly write documents using a word-processor. Fortunately, as the k-blog platform is based on the widely-used WoP, which in turns offers a widely-supported API, this style of working can be readily integrated. It is already possible to author using Word (2007 onward), OpenOffice, Google Docs and LaTeX using integrated or existing technologies, as demonstrated by our previous work at http://ontogenesis.knowledgeblog.org. For Deliverable WP4.1, user oriented documentation, describing these tools will be developed. This documentation will also describe clearly how to present and organise papers in a way which is optimized for the k-blog process. While, we expect this documentation to take a significant time-span to produce, refining it as a result of user feedback, it is important to note that a k-blog is already useful and possible.

To take maximal advantage of linking technologies developed in WP2, we will need to integrate with existing technologies for referencing. As deliverable WP4.2, we will add tooling to enable the use of bibliographic tools such as Endnote, Mendeley, Zotero or BiBTeX to insert references that k-blog can directly translate. Largely, this should consist of “styles”, modifying the in-text citation, as the reference plugin of WP2.1 will generate reference lists. As with other deliverables, this tooling will include substantial documentation, developed using the k-blog process.

4. Project Timeline

 

Name

Start

End

Staff

Notes

WP 1

02/08/2010

30/10/2010

   

WP 1.1

02/08/2010

31/08/2010

All

A documented k-blog process

WP 1.2

01/09/2010

30/10/2010

DS,SC

Implementation with off-the-shelf software

WP 2

01/11/2010

30/04/2011

   

WP 2.1

01/11/2010

26/02/2011

SC

COinS metadata on posts

WP 2.2

01/11/2010

29/01/2011

SC

Client-side, permanently linked versions

WP 2.3

03/01/2011

26/02/2011

DS

Bi-directional links to other datasets

WP 2.4

01/03/2011

30/04/2011

PL

Semantic linking with CITO

WP 3

01/11/2010

30/07/2011

   

WP 3.1

01/11/2010

30/07/2011

GM

Specialist environment – Healthcare

WP 3.2

01/11/2010

30/07/2011

DS

Specialist environment – Microarrays

WP 3.3

01/11/2010

30/07/2011

RS

Specialist environment – Workflows

WP 4

02/08/2010

30/06/2011

   

WP 4.1

02/08/2010

30/04/2011

GM,DS

Authoring documentation and tools

WP 4.2

02/05/2011

30/06/2011

GM,SC

Referencing documentation and tools

 

5. Project Management Arrangements

23The project will be managed from Newcastle University; the primary management will be from Dr Lord who will be responsible for:

  • Developing Project Management Plans;
  • Ensuring that the Project technical objectives are met;
  • Prioritising and reconciling conflicting opportunities;
  • Reporting and collaborating with JISC programme Manager;
  • Dissemination of the k-blog platform.

Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via skype or phone as appropriate. Although most project staff are co-located, primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. User consultation will be via public mailing list, as well as through a “dogfooding” k-blog. All project staff have been handpicked; they are highly experienced and self-directed, as outlined elsewhere. All are associated with several other projects and duties (research, research support, teaching and training), and are responsible for managing these independent workloads.

 

5.1 Risks

24Staff Risk – as with all projects, loss of staff could negatively impact on this project; however, all staff are on permanent contracts, have long histories in research, so this is less likely. Additionally, by dividing the work between five individuals, we limit the risk should a single person leave.

WoP3 and other dependencies – the project depends on other software, most notably WoP for which a new version (3.0) is now in beta; however the software is widely supported. Other software is replaceable.

Standards Shifting – the project depends on a number of standards and these may change. In this project, we will NOT support standards, but rather use those that support us. Where standard change rapidly, their implementation will be delayed (till they stabilize) or dropped. None of the standards described here is critical to the success of the project.

 

5.2 IPR Position

25All code will be developed under open source licences. WoP and RT are licensed under GPL, so code linking to these will be likewise licensed. Code that is separable will be released under LGPL. Code will remain copyright of respective institutions or authors. Any documentation produced by project staff relating to the project will be licensed under Creative Commons Attribution license. Licensing of individual k-blogs will be delegated, but permissive licenses will be encouraged.

 

5.3 Sustainability

26This project is largely based around innovative, novel and leading use of existing software. As such the sustainability of the majority of the technology base is not dependent on project members but large companies with established and proven business models. The k-blog process will be cleanly separated from its implementation, ensuring only weak dependencies to underlying software. Where, we produce software “glue”, public and widely supported APIs will be used where possible. This will ensure that components are replaceable. All code, including historical versions will be publicly available. Documents produced by project staff will be publically available and clearly licensed so will be archived through the internet “cloud” resources; we are also seeking explicit support for archiving from the British Library.

 

5.4 Staff Recruitment

27All staff are already in post.

 

5.5 Key Beneficiaries

28Our key beneficiaries are the public health, microarray and workflow communities; as the k-blog process is based around commodity software, these groups can use the basic environment from the first day of the project to generate and share content. As the project progresses, so will the process, the software to support it and the documentation to explain it; at all stages, the k-blog process fulfils a clear and immediate need. While we are specifically targeting these communities, the k-blog process and platform is sufficiently generic that it can support a wide range of research activities.

Although presented here as a single platform, the process and components are separable and can benefit communities independently. In particular, the tools and documentation from WP2 and WP4 will find use within the research blogging community, who find, in particular, the lack of tooling for referencing difficult. Finally, the statement of a peer-review process, and its implementation within RT will be applicable to any peer-review environment regardless of the form of publication. This includes publications published using wiki or other Content Management Systems.

 

5.6 Engagement with Community

29We consider the mechanism for engagement with four kinds of community: engagement with our core content generating community is an intrinsic part of this proposal, as described in WP3. Further interaction with more disparate groups will be maintained through personal contacts; each of the five individuals named in this proposal are experienced and embedded in different communities (health care, microarray, ontology, proteomics). Engagement with our core content consuming community is, again, an intrinsic part of the proposal; all project communications will be via open mailing list or k-blog. Project members are active users of Web 2.0 social technologies; our initial trials as part of Ontogenesis showing this approach to be highly effective form of dissemination, with minimal effort. Engagement with software users will be via website and direct interaction. All software will be released or advertised via normal channels (website, versioning, and mailing list), including a (debian) package repository for those wishing to set up their own server. Finally, developer communities will not be specifically targeted, but our open source, continually integrated development plan will be attractive, and we will accept suitably licensed contributions.

30All communities will benefit from the open and agile development methodology we will adopt; changes to the environment will be integrated and released rapidly, ensuring continual improvement and facilitating rapid feedback cycles.

 

6. Previous Experience and Project Team

 

31Dr. Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He was an active participant in the Ontogenesis network, and developed the initial idea for knowledge blogs as part of this. He is an active blogger and developer.

 

32Dr. Georgina Moulton is an Education and Development Fellow at The University of Manchester. Since 2005 her main roles have been to co-ordinate the development, and delivery of multi-disciplinary bio/health informatics education programmes; and to facilitate the engagement of biological and health communities in a variety of bio and health informatics research projects (e.g., ONDEX, Obesity e-Lab). For 3 years, Georgina was the EPSRC funded Ontogenesis Network Manager, in which she co-ordinated the activities of the network and expanded the network through the facilitation of the development of new activities and was involved in the trial k-blog process. More recently her work includes the development and delivery in conjunction with NHS partners of an education and development programme tailored to match the needs of North West public health analysts and the wider healthcare workforce.

 

33Dr. Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001.  Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities.  Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data.  His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio-Linux) as well as acting in a research capacity analysing high-throughput data.

 

34Dr. Simon Cockell has a PhD in Genetics from Leicester University, and refocussed into Bioinformatics with a Masters degree from Leeds in 2005. From there he moved to Newcastle, and the Bioinformatics Support Unit. Since coming to Newcastle, Simon has worked on a range of projects involving large scale analyses (AptaMEMS-ID), data integration (Ondex) and health informatics (MRC Mitochondrial Disease Cohort). 

 

35Dr Robert Stevens is a senior lecturer in Bioinformatics in the Bio and Health Informatics group at the University of Manchester. His main areas of research are in the development and use of semantics within the life sciences. This is blended with the use of e-Science platforms to gather and manage the data and knowledge of the life sciences. He was PI on the Ontogenesis network that ran the meetings for the first k-blog. He is or has been a co-investigator on the myGrid and myExperiment grants that will provide both content and technical input to this project. As well as the JISC funded myExperiment project, Stevens was an investigator on the JISC funded CO-ODE project that developed Protégé 4. On the back of this, Stevens has led the OWL training activities at Manchester that has directly fed in to the Ontogenesis k-blog. This range of experience makes Stevens an ideal partner to lead the development of content within this project.

 

One Comment

  1. Duncan Hull says:

    congratulations phil, great news!

Leave a Reply