This is latest grant that we have submitted to JISC, in this case for a new application of the knowledgeblog platform. As usual, it is a direct post from word, so there may be a few presentational issues in it.

 

The grant is currently under review; I will post the outcome and any feedback (if possible) once I have a result.

Outline Project Description

In this project, we will generate a large body of web content, demonstrating the applicability of commodity blogging technology as supplement to the Universities existing eprints archive. Through a use of technology pioneered by the JISC funded Knowledgeblog project, we will publish 100+ scientific articles, from a variety of different word-processing environments, in a structured-web capable form rather than as PDF. This content will then be augmented to demonstrate the advantages of leverage from a commodity platform, enabling novel mechanisms of publication.

1. Introduction

1The modern publishing industry has been massively affected by the development of the web. However the impact has been highly varied across different domains. Publications that address news events or encyclopedic knowledge have been very heavily affected; other areas have changed little. The web initially developed from the desires of scientists to share knowledge; in some areas, such as biology, the uptake of web technologies has been little short of extraordinary. It is ironic, therefore, that the publishing of formal academic papers has been affected relatively little by the web. Although, content page listings may have been largely replaced by RSS or email, and papers may be available as HTML, they are still largely constrained by the print requirements, packaged as PDFs, poorly linked, with static figures.

 

2An alternative publication mechanism has already been funded by JISC as part of the “Managing Research Data” programme. As part of the Knowledgeblog project, we have investigated using a publication tool, which integrates well with scientists’ existing work-practices, based around a commodity blogging engine, namely WordPress. There are a number of tools such as Open Journal Systems, or organizations like Scielo which allow the web publication of academic articles. While these have large user bases (OJS — 6000 journals, Scielo — 600), currently, WordPress is used to drive around 10% of the world‘s websites; a user base orders of magnitude larger. WordPress, therefore, performs the basic tasks of publishing articles extremely well, scaling to millions of page hits, enjoys tool support from many word processing environments and benefits from many augmentations for specialist audiences. We have extended this tool with a few specialised extensions of our own and, as a result, made it more suitable for academic publishing. We have then used this tool as the basis for two journals, in this case, aimed at producing educational resources describing ontology technology (http://ontogenesis.knowledgeblog.org), and the JISC-funded Taverna workflow system (http://taverna.knowledgeblog.org).

 

3These two resources are, in effect, “gold open-access” — although not requiring author payment. They present content which has not been presented elsewhere, but was written for the purpose; articles have been (or are progressing through) a formal review process. While this has provided a useful resource, generating over 15k page views, these resources are designed to be coherent in scope; although this is generally a positive virtue, by definition it allows us to investigate the suitability of the tooling for only a small number of articles and a limited domain.

 

4Newcastle University has a strong history in supporting gold open access publication: it was the site for the first open access law journal in the UK (http://webjcli.ncl.ac.uk/). In addition, it also has a large and successful eprints repository (http://eprints.ncl.ac.uk) archive, currently hosting 50k articles or bibliographic records; in this project, we will exploit the eprints archive to provide content, building a substantial knowledge resource; this will both demonstrate the suitability of the Knowledgeblog tool-chain as the basis for green open access publication, the value of this novel form of publication, and provide the vital testing against content “from the wild”, allowing us to extend the suitability of this tool-chain to as many areas of academic discourse as possible.

2. Fit to call

5The project call notes that JISC is or has funded many projects relating to scholarly communication. These include: infrastructural support in the form of institutional repositories; support for open-access; and support for novel mechanisms of publication such as overlay journals. Specifically, theme D – campus-based publishing – is aimed at increasing the capacity of the sector to publish and disseminate research outputs directly. The call also highlights attempts such as the “Beyond the PDF” workshop to move toward more structured forms of knowledge; while, in theory, PDF is capable of supporting relatively rich structuring, in practice, most of the tools which generate files in this format produce a relatively opaque, binary artefact from which it is difficult to extract information, or to repurpose or recast that in any way.

 

6While open-access publishing has made significant strides in the last 10 years, becoming an accepted part of the academic landscape, Gold open-access – the publication of original content – still accounts for the minority of academic publications. Green open-access – author publication of content often published elsewhere – now accounts for up-to 25% of the literature in some fields.

 

7Institutional repositories such as that run by Newcastle (http://eprints.ncl.ac.uk) or author archiving on their website (e.g. http://homepages.cs.ncl.ac.uk/phillip.lord/publications.html) are the most common route for green open-access publication. While increasing access to academic materials is a very positive step, this form of publication is largely limited to providing access to a PDF. From neither the authors, nor the readers point of view, is there significant added value to the publication. For example, our experience is that authors are often equivocal or disinterested in publication in institutional repositories as it is “just-one-more-thing” to do, while maintaining a website requires significant technical expertise.

 

8For this grant, academics at Newcastle supported by the infrastructure provided by the local librarians will provide an alternative; we will identify authors within Newcastle, take their open-access publications and recast them into a form suitable for WordPress. We will do this with their active permission and engagement, using the tooling we have developed or documented as as part of the previously-funded JISC “knowledgeblog” project. Where authors wish to, we will support them in performing this work for themselves; where they do not want “just-one-more-thing”, we will leverage off the existing eprints process, and perform this work for them. In general, this can be performed directly using MS Word, latex or other word-processing software, whichever is the authors’ preferred editing environment. In addition, we will use this process to increase the usability of the tooling, increasing the ability to and likelihood that authors will directly publish their work in fashion. As this proposal is built on existing work from the University eprints archive, library-support is implicit within FEC and not specifically or additionally costed.

 

9Once publications are available in this framework, authors and readers will be able to take advantage of the additional features which come either from WordPress directly, or from augmentations provided or assessed by the WebPrints team. For example, authors will be able to see rich content-access statistics, including page-views, referrer and incoming link information. Published articles will be bi-directional linkable using trackbacks. Authors will be able to add tags, zoomable equations or automatically generated reference lists depending on their level of technical competence. For viewers, category and tag based RSS feeds will be available, searching, bi-directional linking (again!) will be possible. As a result of the work from the previous knowledgeblog grant, all posts will be tagged with metadata, in various forms, and will be available for formal archiving outside of the University.

 

10The publication framework is based around WordPress which is freely available, scalable, stable and hardened by its multiple user base. The system is continually updated, but has a good reputation for maintaining backward compatibility. The authoring framework is based around commodity tools such as Word or latex. Most of the workflow process within Newcastle is pre-existing as part of the eprints service. This project therefore provides a sustainable and novel enhancement to the existing process.

3. Workplan

3.1 WP1 Management, Systems Administration and Set up.

11This work package will fulfil the basic management and administrative tasks required for the project. This will include setup of the repository, styling and theming appropriately for the project; definition of a basic workflow for management of documents and metadata; fulfilment of standard JISC reporting requirements.

12We request additional funding of 1k as part of this work-package for virtual server upgrades (additional disk space), dropbox space to enable document management, and wordpress anti-comment spam support.

3.2 WP2 User documentation.

13Most of the operational, “how-to” documentation is already available: either at http://process.knowledgeblog.org (developed by the JISC funded knowledgeblog project); or, as the repository is based on commodity technology, from many publicly available websites.

 

14However, there will be information specific to the Webprints archive; about copyright, about document management, and about the relationship to the university. For this, we will need to generate some specific documentation.

 

15As the project progresses, we will improve and enhance this documentation, based on our experiences, including for example, statistics on how long author self-deposition takes.

3.3 WP3 Author advertising and Material identification

16We will seek active engagement with our user community, by linking into the current eprints system. Combined with the Newcastle-specific, internal “myimpact” database (which was designed to capture research outputs for the next REF), this will enable us to identify new publications as they come out. In the first instance, we will select material that has been published in open access journals (or where embargo periods, or other conditions allow). We will contact authors individually, inform them of our project, and advising them about the methods for recasting of their paper (see WP4).

 

17We will not preselect on the basis of academic quality, only technical and legal (copyright) grounds. Although the eprints service displays full text as PDF only, the myimpact database in many cases also stores MS Word (or equivalent) formatted data. We will, therefore, prefer papers where this data is available. We will prefer papers which are recent over those which are older. Finally, we will prefer papers which give us a wide spread of authorship and discipline.

 

18Although the focus of this proposal is on the provision of a service for publication of green open access material in a fully web-capable format, we will be happy to receive grey literature, on an author-publication basis.

3.4 WP4 Paper recasting

19This work package will take papers selected as part of WP3 and publish them to the webprints archive. In most cases, this work will be performed using tooling developed or documented by the previously funded JISC knowledgeblog project.

 

20We will publish articles in three ways:

Webprints team published. All work will be performed by members of the Webprints team. For each paper, we will write a short report, describing any issues with the publication process, and any errors seen (which we will hand-correct). We will gather statistics on the time taken to publish. Papers will be published on an “as-is” basis; that is we will not seek to enhance the content at this point. We will add metadata in a structured way, which will be accessible from the web presented version.

Author published, webprints supported. We will work directly with authors to publish papers and help them. Where possible, we will augment and add new features (latex maths support, citation). These papers will be marked as featured, and augmented. Again, we will gather statistics on the time taken to publish, broken down for additional functionality.

Author published. Authors will publish directly into Webprints, using either their pre-existing experience, or our own user documentation. We will request, but not require statistical feedback. Publication will be as the author wishes — as-is, or augmented with additional functionality.

 

21All papers will be annotated with standard metadata in a structured form; our previous work means that this metadata will be available from the web presentation of the paper.

3.5 WP5 Repository and process enhancement

22For this package, we will focus on two key aspects: tooling for publishing papers and their presentation once there.

 

23For the presentational issues, in the first instance we will focus on enhancements which do not require support from the article material. For example, as we will add metadata to articles, which will allow us to generate metadata headers (CoINS, standard meta tags etc) without further analysis of the article material itself. Likewise, our experience with the knowledgeblog project means that we can support “out-of-the-box”: multiple export formats (including HTML, PDF and ePUB); site wide indexes (by year, author, subject etc); comments; trackbacks and page feeds (including from subsections). Through use of third-party software, we will also be able to add: related papers through textual analysis; tag clouds; twitter backs; automated multi-lingual presentation and social networking support.

 

24We will also investigate enhancements which require modification of the original content (and therefore increased interaction with authors). From the knowledgeblog project these will include: scalable equation presentation; and client-side generated bibliographies. We will also add “custom posts” for supplementary material (spreadsheets for instance). And, finally, through the use of third-party material, enhancements such as syntax highlighting, zoomable maps, slideshows and so forth. This part of the proposal is designed to be open-ended and exploratory; which forms of enhancements, we pursue will depend on the types papers selected and interactions with the authors. There are currently over 13,000 plugins available for wordpress, which provides us with a considerable resource to build from.

3.6. Timetable

Name

Begin date

End date

Resources

WP1.1 – Setup Repository

02/05/11

14/05/11

SC, AL, DS

WP1.2 – Document Workflow

02/05/11

14/05/11

PL

WP2.1 – User Documentation

09/05/11

24/05/11

DS, PL

WP2.2 – User Statistics

16/05/11

31/08/11

SC, AL

WP3.1 – Author Engagement

16/05/11

31/08/11

SC, AL, DS, PL

WP4.1 – Paper Recasting

01/06/11

30/09/11

SC, AL, DS, PL

WP5.1 – Repository Enhancement

01/07/11

30/09/11

SC, AL, DS, PL

4. Deliverables

25A repository of open-access articles in a fully web-capable format. This will act as a supplement to the existing eprints archive at Newcastle. We expect to generate around 100 articles in this form, although this is likely to be an underestimate. We are currently estimating throughput from our experiences with Knowledgeblog, which involved relatively few articles. The process should benefit from high-throughput experience. Further documentation, published on http://process.knowledgeblog.org, describing the process that we have used to set up this repository. Enhancements to tooling, enabling others to publish more easily in this manner. Additional experience and software enhancing the presentation of data held in this form.

5. Project management arrangements

26The project will be managed by Dr Lord, who will be responsible for:

  • Developing Project Management Plans;
  • Ensuring that the Project work package objectives are met;
  • Prioritising and reconciling conflicting opportunities;
  • Reporting and collaborating with JISC programme manager
  • Dissemination of research results.

 

27Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via Skype or phone as appropriate. Primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. We will use other readily available tooling to manage the document process pipeline – Google spreadsheets, dropbox, and likewise for software development (Google code). All staff are associated with other projects or service provision (research, teaching, training); they will be individually responsible for managing these workloads, and are highly experienced at doing so.

5.1 Risk Management

28Staff risks – the basic organisation of the project has been designed to mitigate against staffing issues. All staff are in post and are highly experienced, with long-track records at Newcastle. Costs have been split three ways, therefore even if in the unlikely event that one member of the team leaves during the project, it will not cause significant distruption.

29Software risks – we are using commodity technology, which is very well proven and supported. None of the software is critical (even our basic blogging engine, wordpress, is replaceable). Therefore, while changes in third-party software might degrade or slow progress, it will not halt.

30Engagement Risks – the project requires a level of engagement from Newcastle researchers, which may not materialize. We have minimized this risk by minimizing the effort the engagement takes on behalf of the researchers. The project members are well known to many in the university (DS and SC comprise the “Bioinformatics Support Unit” and have worked for many PIs personally). We have active engagement from the library, in particular from Moira Bent (Science Faculty Liaison Librarian), and Paula Fitzpatrick (Digital Libraries).

5.2 IPR position

31The bulk of the content handled by this work will come from authors within the University. The current restrictive copyright requirements of many publishers place uncertain limits on what can or cannot be done with this content. For this reason, we will use articles that have been published with or have become available under creative commons or other open access license.

 

32Project members will release written work (documentation etc) under a Creative Commons Attribution ShareAlike 3.0 Unported License (CC BY-SA), which allows re-use and modification for non-commercial purposes with attribution. This is in line with the JISC Model Licence. Software linked to WordPress will be released under GPL, as required by the WordPress license. Software which is separable will be released under LGPL. Software linked to other third-party libraries may use other license if required; this will be limited to Free/Open source licences.

 

5.3 Sustainability

33This project is largely based around innovative, novel and leading use of existing software. As such the sustainability of the majority of the technology base is not dependent on project members but large companies with established and proven business models.

 

34The WebPrints archive will be run from the same server as knowledgeblog.org; this is being developed and maintained and will be for the foreseeable future, and the additional of the WebPrints archive will not be a substantial additional cost. However, should this cease to happen, the content of the WebPrints archive will be creative commons or an equivalent permissive license. This will make it possible for the JISC funded UK Web Archive to store the website for the future.

 

35Although, we will not be able to sustain publication by the WebPrints team past the lifetime of this proposal without further funding, author publication will be possible; our experience with existing tooling is that this is possible for many, although requires some level of technical skill, depending on the word-processor package, and level of complexity of the paper.

5.4 Staff Recruitment

36All staff are already in post. Recruitment during the project will therefore be unnecessary.

5.5 Key Beneficiairies

37Our immediate beneficiaries Newcastle University staff, who will have their work published using a new and novel publication technique. Critically, we will demonstrate the value of this form of publication technique to both researchers and librarians within the University who will in future be better placed to use or support this technology to publish their own or others work in future.

 

38Although presented here as a discrete project, the work fits within the background of the wider blogging community. So, our own knowledgeblog project and website will be able to take advantage of software improvements that will happen as a result of this work. Additionally, the general academic blogging community will gain a new resource. Increasingly, this community is a critical path for public engagement in the academic process.

5.6 Community Engagement

39Community engagement will take place initially by direct contact; we will email authors to ask for their engagement in the publishing process. This should have the secondary effect of advertising the presence of our project. We have active engagement from the library staff, who are well known within the University. In terms of engagement with the resource outside of Newcastle, we will make active use of various web and social networking facilities. Our experience has shown that this can generate significant amounts of engagement in a relatively short period of time. Finally, we will advertise the work through standard academic channels of conference and journal publication; although effective, this tends to be slow. This is problematic for a short project, hence we consider this to be a secondary means of communication.

 

6. Budget

 

Removed for privacy reasons.

7. Project Team

 

40Dr. Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He is a active participant in the Scientific Blogging community, developed the initial idea for knowledgeblogs. As well as managing the knowledgeblog project, he is the developer of tools such as “Latextowordpress”, as well as WordPress plugins such as “Mathjax-latex” and “Kcite” all of which improve the usefulness of wordpress for academic communication.

 

41Dr. Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001. Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities. Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data. His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio- Linux) as well as acting in a research capacity analysing high-throughput data. He is currently active within the knowledgeblog project, having been responsible for adding software support for a review process, gravatars, syntax highlighting, PDF and ePUB exports.

 

42Dr. Simon Cockell has a PhD in Genetics from Leicester University, and refocussed into Bioinformatics with a Masters degree from Leeds in 2005. From there he moved to Newcastle, and the Bioinformatics Support Unit. Since coming to Newcastle, Simon has worked on a range of projects involving large scale analyses (AptaMEMS-ID), data integration (Ondex) and health informatics (MRC Mitochondrial Disease Cohort). He is currently active within the knowledgeblog project, having been responsible for metadata support (including Coins), navigational support (for both humans and robots) and is a co-author of kcite and mathjax-latex.

 

43Allyson Lister worked for 6 years at the EBI in Cambridge, developing and producing the UniProt/TrEMBL protein database. She is currently focusing on the use of ontologies for the semantic integration of systems biology data with her current job at CISBAN in Newcastle University. Both at the EBI and at Newcastle University, she developed structured data formats including UniProt/TrEMBL and SBML. She has also been an early adopter of blog technology as a mechanism for communication of both her own and others primary research. Since 2006, she has co-authored a number of posts with other bloggers in the community and has been invited to be a guest author at both the ISCB news and the BioSharing blog. She has published papers highlighting the importance of social networking and live blogging to bioinformatics.

Paola Marchionni of JISC has give her permission to reproduce the feedback from the peer-review of my last JISC grant which sadly failed. I want to publish it here, as part of my desire for open science rather that as an opportunity to reply which, perhaps unfortunately, the JISC process does not otherwise allow.

I am a little surprised by some of the comments, to be honest. The main criticism was more expected though, which essentially says “it’s not crowd-sourcing if you pay people to develop content”. You have to try these things, but I did think that actually paying for content might be considered to be a little revolutionary. Ah, well, better luck next time.

Markers felt the form of this proposal was “robust”, however there wasn’t enough clarity on the deliverables and especially on how the value of what was being produced would be assessed down stream. They felt there was also some lack of information on how the currently JISC funded K-Blog project, due for completion in July 2011, related to this project and what the impact on its team would be, which seems to be the same team as the one proposed for this project.

The main concerns, however, were around whether this could really qualify as a crowdsourcing or community project – it was felt it was more about disclosing data than community engagement – also considering that the authors of the articles would be paid. There were some doubts about the sustainability of the project beyond the 7 months duration of the funding, as lack of funding would prevent more articles being created and metadata added by the team. One marker also felt that a risk analysis should have taken into account the risk of disparate communities not being aware of the content and using and engaging with it. A more clear identification of the various communities the project aimed to reach and a more targeted strategy for engaging with such communities would have been useful.

Finally, another issue that was raised was that there wasn’t sufficient information on how the partnership with Manchester University would work, either formally or informally, and the dissemination plans could have been stronger, as they relied mainly on the role of K-Blog.

— Paola Marchionni

About

This is the full text of a grant called “Knowledge in Biology” that we submitted to JISC, as a follow-up to our knowledgeblog grant. Unfortunately, this grant was not accepted. This blog post is the direct output result of Word; apologies if the conversion is imperfect.

 

 

 

Outline Project Description:

Many disciplines within the sciences are knowledge-rich; of these, biology is an extreme example. In order to make advances, biologists need to be able to access knowledge from both their own and related communities in an easily digestible form. However, the publishing of this knowledge does not fit well with existing scientific communities, as it is often not regarded as “research based” – rather it is a stored body of grey literature, often not publically available. In the Knowledge in Biology project, we will engage with disparate communities in disciplines that engage with biologists as well as the community of biologists themselves. We will generate substantial content describing how “Knowledge in Biology” is both produced and consumed in the pursuit of new discoveries, by commissioning the authorship of this content directly from the funding for this project.

We will leverage the output of the JISC-funded Knowledge Blog platform, as a tool for coordination, publication and dissemination of this content. The result will be a publically accessible, high-impact resource of short, readable and accessible articles describing how to gather, manipulate and synthesise knowledge in biology. This will be of significant value in supporting the multidisciplinary research that is necessary for advance in modern biomedicine.

 

1. Introduction

1This document describes a proposal for a project within the JISC “e-Content” programme call.

2Modern biology is a rich, complex, multi-disciplinary field. In particular, practitioners need knowledge about how to access, organise and structure knowledge itself. As a result, members of the community often need to cross the boundaries of traditional societal structures within research. By definition, this is not well supported by the more formal structures that scientists use for the publication and dissemination of knowledge. So while the information exists, it is not accessible; hidden from the community on the desks and hard-drives of individuals.

3One of the difficulties with migrating this community-based knowledge away from grey literature to a more openly-accessible archived and referenceable form is the lack of a formal reward structure. Although scientists may engage in this form of activity from a sense of public duty, this form of documentation is not critical for their career advancement, or for gaining academic creditability, and so it is rarely made a priority. While technological advances have made publication of this material straightforward, the social structure of science has not supported it. As a result, there is a large body of knowledge about how biologists conduct their work that is simply lost to the community, meaning considerable lost time and effort recreating this knowledge, only for it to be lost again.

4We plan to circumvent this societal barrier using a novel approach – we will directly commission the authoring and reviewing of articles embodying this content. As the knowledge will often be readily available to individual members of the community, and we are aiming for articles which are neither of the size nor complexity of formal research publications, it will be possible to generate a substantial body of content, at relatively low-cost.

5An ideal mechanism for publication of this knowledge has already been funded by JISC as a part of the “Managing Research Data” programme. This is the Knowledge Blog project: a light-weight publication tool, that integrates well with scientists’ existing work-practices, based around a commodity blogging engine. This ‘Knowledge in Biology’ project (KiB) will utilize the work from Knowledge Blog, to the benefit of both: this project will gain a technological underpinning at little cost – Knowledge Blog already exists and will require a small increase in resources to manage the additional content and traffic; Knowledge Blog will gain substantial content and enormously increased visibility.

6The KiB project will provide a small amount of funding for the management and commissioning of articles, but the majority of the funds will be spent by using individually small amounts of money, crowd-sourcing the development of a novel digital content resource, engaging the community of biomedical researchers, both as authors and reviewers. The content will address key issues relating to knowledge in biology such as, data standards, linked data, knowledge in synthetic biology and statistical approaches to knowledge, as well as “softer” issues such as the use of Web 2.0, the social web, and the blogosphere as tools for the biomedical researcher.

2.1 WP1 – Knowledge Blog (k-blog) maintenance and support

 

7The primary purpose of this proposal is to generate significant quantities of digital, community-developed content. The k-blog platform already exists, supported by a previous JISC call. We are not, therefore, proposing to make significant enhancements to either the process or the software in the course of this project. However, the additional load placed on the platform will require a small amount of administrative work in terms of maintenance.

8In addition, we will need to provide support to the users of the platform; while k-blog is relatively easy-to-use, issues do arise with authoring, with formatting or with exceptional requests (for example, multi-media documents).

9For articles to be properly citable and maintainable, manual intervention is required to supplement the text with computationally accessible metadata, including DOI assignment. This enables improved archiving and discovery, which increases the value of the resource. As part of WP1, we will annotate documents with this metadata to ensure consistency and to avoid placing the burden on the main authors.

10We will install and refine a licensing plugin for the k-blog platform, which clearly displays license information for each article, based on the author’s selection.

 

2.2 WP2 – Management of publication process

 

11Articles in KiB will be produced by crowd-sourcing and by the in-house team (WP4). Our aim is to bootstrap the KiB k-blog so that it reaches a critical mass of articles that will attract both readers and more authors. We will commission articles from specified, expert authors with the attractor of a small payment. The payment will require the contributor to both submit an article and a review for another article.

12In preparation for this work, we have compiled a list of topics for KiB and put names against these topics. We have clustered the topics around themes in KiB: The role of semantics In biology; the representation of knowledge in ontologies, terminologies and vocabularies; data integration to create knowledge resources; data and knowledge standards; knowledge technologies such as RDF, Linked data, OWL, etc.; text mining; case studies and applications of knowledge in biology. These clusters, and more, will become the categories in the KiB k-blog. The letters of support indicate the significant number of authors that have promised to author an article on one of these topics. We will seek as wide a selection of authors as possible, guided by our advisory committee (see Section 2.8), to help give the KiB k-blog a balanced view on knowledge in biology. A significant part of this WP will be the commissioning of these articles and discussions with authors on this new digital content sourced from the community.

13This process will need managing: requests for particular articles (WP2.1); negotiation on topic and scope (WP2.2); managing of the author-guided review process (WP2.3); and, enabling payments to be made. This activity will help ensure that the core of the KiB k-blog will be of sufficient quality to attract readers to comment and contribute articles, as well as to simply read and learn.

2.3 WP3 – Outreach and Community Engagement

 

14Outreach and community engagement are intrinsic to this project. The presence of a high-quality, organised resource, freely available on the web will attract readers; likewise, a widely-read resource will be attractive as a publication centre for authors, particularly when supported by funding as part of WP2. The use of a rapid publication framework, available on the web, archived by the British Library and indexed for searching by Google, therefore, is our main form of outreach.

15However, this process can be augmented. All content will be available and reusable under a Creative Commons license, making it reusable with citation outside of the KiB environment. We will maintain active “Social Web” streams through Twitter. We will solicit articles relating to the use of Twitter and the blogosphere from members of the scientific blogging community; as well as generating content, this will leverage their existing readership, raising awareness of KiB, both as a resource for readers and authors. We will maintain a well-advertised mailing list allowing requests for, or offers of, new articles either commissioned or otherwise.

16Finally, we will advertise the resource through normal academic channels of paper and poster presentation. Where possible, we will also propose micro-workshops (aka Birds of Feather meetings) at suitable meetings/unconferences.

2.4 WP4 – ‘In house’ article authoring

 

17The staff on the project will contribute a significant number of articles to the KiB k-blog. Stevens will produce 20 articles; Lord 10 articles and Swan 10 articles (WP4.1). Both Lord and Stevens have already contributed articles to the Ontogenesis k-blog and will further extend on Ontogenesis in the wider KiB topics. These topics will include articles on tips for modeling in OWL; using ontologies with linked data; converting data to RDF and linked data; On-line knowledge resources; using ontologies in over-representation analysis of microarray data; integration strategies; and so on. Some of these in-house articles will act as glue that draw together many of the other articles. For example, an article on the role of knowledge in biology will draw together the need for the k-blog and act as a pathfinder. Where appropriate, we will use tools such as “Anthologize” and “Web Trails” to facilitate these aggregation activities. In house articles will be reviewed (WP4.2) by an external reviewer, potentially from the pool of contributors sourced in WP2.

2.5 WP5 – Project Management and JISC Requirements

 

18Management of the project will use regular weekly teleconferences, to ensure that all aspects are proceeding according to the project plan. In addition, we will fulfill the legal requirements for collaboration agreements and the formal reporting requirements from JISC as part of WP5.

19To ensure maximum community and public engagement in this proposal, all appropriate documents will be posted using the k-blog environment in addition to those locations specified by JISC, except where that information is withheld under normal FOI rules.

20Finally, we will gather and collate statistics on the use of these articles as measures of impact; directly in terms of page views from the underlying k-blog platform; indirectly from incoming links (both those using trackbacks, and those discovered using Web searching tools) and comments; and finally through secondary indicators such as Twitter and email communications. These statistics will also be made publicly available where appropriate.

2.6 Timetable

 

Name 

Start 

End 

Staff

Notes 

WP1 

1/3/2011 

30/9/2011 

DS, PL 

Maintenance of k-blog infrastructure 

WP2

1/3/2011

31/7/2011

   

-WP2.1

1/3/2011

1/4/2011

ALL

Crowdsourcing of articles

-WP2.2 

1/4/2011

31/7/2011

ALL

Content negotiation and creation

-WP2.3 

1/4/2011 

30/9/2011 

ALL 

Articles reviewed and published

WP3

1/3/2011

30/9/2011

ALL

Outreach and engagement

WP4

1/3/2011

30/9/2011

   

WP4.1 

1/3/2011 

31/7/2011 

ALL 

In-house content generation

WP4.2 

1/4/2011 

30/9/2011 

ALL 

In-house articles review and publication

WP5 

1/3/2011 

30/9/2011

PL 

Project management and JISC compliance 

 

2.7 Deliverables

 

21A high-quality body of content, consisting of a series of articles from multiple authors; describing different topics fitting within the theme of “Knowledge in Biology”. 40 of these articles will be authored in-house. A further 200 will be sourced with consultancy payment. We anticipate many others will come from crowd-sourced, enthusiastic authors, engaged with the process.

 

22A website, based on the k-blog platform, that delivers this content.

 

2.8 Project management arrangements

 

23The project will be managed from Newcastle University; the primary management will be from Dr Lord, who will be responsible for:

 

    - Developing Project Management Plans;

    - Ensuring that the Project work package objectives are met;

    - Prioritising and reconciling conflicting opportunities;

    - Reporting and collaborating with JISC programme Manager;

    - Dissemination of community content.

 

24Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via Skype or phone as appropriate. Primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. User consultation will be via public mailing list. Close tracking of requests for content and payment of authors is essential, and transparent procedures will be put in place for this. All staff are associated with several other projects and duties (research, research support, teaching and training), and are responsible for managing these independent workloads. All have experience with the k-blog platform and process.

 

25We have formed a small, unpaid advisory committee from recognised experts in the field. They will be invited to give feedback on the topics covered at 2, 4 and 6 months into the project; this will help to ensure an even and representative coverage of the area, that is not overly biased by the particular interests of the staff on the project.  Mark Musen (Stanford), Chris Rawlings (BBSRC Rothamsted) and David Shotton (Oxford) have all agreed to be our advisory board.

2.9 Risks

 

26Staff Risk – as with all projects, loss of staff could negatively impact on this project; however, all staff are on permanent contracts, have long histories in research, so this is less likely. Additionally, the nature of the workload means all staff would be able to cover duties relating to sourcing and generating community content, we limit the risk should a single person leave.

 

27Lack of community engagement – the strength of this proposal depends on contributions from many different authors, generating new, novel and, currently, unavailable content. However, there is also a risk that the community will not wish to contribute. We have limited this risk by offering to pay people consultancy rates – an unusual reward within academic research; however, we will only need to commit funds following the submission of the content, so should authors not deliver, we will reallocate these funds. Should we still find it hard to solicit contributions, we will increase the rates per article.

 

28Technology dependencies – Content will be disseminated in the form of k-blogs, and thus there is a dependency on the k-blog platform. It is already suitably developed and packaged. The k-blog platform is a publishing framework only; it is not essential for the authoring of articles. This limits the scope of the risk. Content could be published independently of the k-blog platform, with only a small loss in the feature set. Additionally, content could be relocated elsewhere at any time; it would retain its value outside of the k-blog platform. With the archival agreement under the Sustainability section, archives of the original KiB content will always be available.

 

2.10 IPR position

 

29It is essential that content is released with as few restrictions as possible on re-use and re-purposing, but authors must be allowed to maintain credit associated with the original work, or they are unlikely to contribute. Project members agree to release their work under a Creative Commons Attribution-NonCommercial ShareAlike 3.0 Unported License (CC BY-NC-SA), which allows re-use and modification for non-commercial purposes with attribution. This is in line with the JISC Model Licence. Authors invited to submit articles will be allowed to choose a Creative Commons licence of their own but will be strongly encouraged to use as permissive a licence as possible. Choice is offered to allow considerations of different institutional policies on published content. Public domain submissions will also be accepted to accommodate US government employees; these submissions will be uncommissioned.

 

2.11 Sustainability

 

30To maintain the persistence of the online resources beyond the end of the project, documents produced by project staff and KiB contributors will be publically available and clearly licensed. The k-blog site and sub-domains are already archived by the UK Web Archive, in which JISC is an active partner. The Digital Curation Centre will be asked to provide strategies for long-term database archiving.

 

2.12 Staff Recruitment

 

31All staff are already in post.

 

3 Impact

 

32Our key beneficiaries are the community of researchers working to develop knowledge in biology. Specifically this focuses on groups involved in data standards, linked data, knowledge in synthetic biology and statistical analysis of biological data. The needs to this community are clearly demonstrated from our Ontogenesis experiment, which is currently receiving 1000 page views per month for a small number of articles. Simple question and answer websites such as http://biostar.stackexchange.com/, receive over 2k page views per week; however, there is a gap between this and more formal knowledge.

 

33We will generate statistical information, using the k-blog platform as a clear metric of impact; for freely available, reusable and web-delivered content indicators such as page views are well recognised, and the main form of impact assessment. Both natively, and through tools such as Google analytics, the k-blog platform can provide comprehensive and detailed feedback on access of individual articles. We will also exploit secondary impact measures, including Twitter through appearance of suitable hashtags; comments and trackbacks to articles on KiB; and, finally, links to KiB as provided by web search.

 

34We will seek to increase impact through a number of activities in addition to normal academic channels. First, we will invite contributions from well-known members of the scientific blogging community that should result in secondary readership. Second, we will invite contributions on relevant topics that have become of recent public interest. Thirdly we will monitor article popularity; for areas that prove to be of interest or are controversial we will seek to commission additional content.

 

4 Partnership and dissemination

 

35Internal engagement of core project members, and the wider community of researchers crowd-sourced to supply content will be via the mailing list, after initial approaches are made. The plans for content generation are further outlined in WP3 and WP4. Content generation will allow further interaction with more disparate groups (content consumers), who will be encouraged to engage through the k-blog process and the project mailing lists. The advisory committee will be able to ensure that our engagement with the content-producing community is representative of the community. The nature of the k-blog process means dissemination is intrinsic to content generation.

36Project members are on the existing JISC funded Knowledge Blog grant in the “Managing Research Data” programme. We will approach individuals with funding from this and other programmes, requesting articles describing the value of these projects to biologists. We will, of course, also be pleased if JISC programme managers wish to contribute articles to this knowledge in biology resource.

6 Previous experience of the Project Team

 

37Dr Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He was an active participant in the Ontogenesis network, and is currently leading the JISC funded Knowledge Blog project. He is an active blogger and developer.

 

38Dr Robert Stevens is a reader in Bioinformatics in the Bio and Health Informatics group at the University of Manchester. His main areas of research are in the development and use of semantics within the life sciences. This is blended with the use of e-Science platforms to gather and manage the data and knowledge of the life sciences. He was PI on the Ontogenesis network that ran the meetings for the first Knowledge Blog. He is or has been a co-investigator on the myGrid and myExperiment grants that will provide both content and technical input to this project. As well as the JISC funded myExperiment project, Stevens was an investigator on the JISC funded CO-ODE project that developed Protégé 4. On the back of this, Stevens has led the OWL training activities at Manchester that has directly fed in to the Ontogenesis Knowledge Blog. Stevens currently leads content development for the JISC Knowledge Blog grant.

 

39Dr Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001. Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities. Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data. His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio-Linux). He has been most recently involved in the JISC Knowledge Blog grant, providing technical support and engagement with microarray community.

 

I was delighted recently to discover Greyhole. Essentially, it’s a system that allows you to configure a Samba share at one end, and a bunch of disks at the other. The disks get the data shared between them, with a configurable level of duplication. It’s aimed mainly at the home user, who wants a higher degree of data security than the single drive approach provides, but is not going to go the expensive and poorly scalable RAID approach.

The implementation is fairly straight-forward and elegant. The Samba share is provided by a customised Samba virtual file system. This augments the standard process by logging to a spool region (one file per file operation). A daemon consumes these files, stuffing them into a database, then consumes the entries in the database. Essentially, if anything has changed, greyhole rsyncs the change to one or more of the backend disks.

It’s a really nice system. I must admit that PhP wouldn’t have been my first choice, but that is horses for courses. Likewise, the dependency on Samba is unfortuante — I always found it a pig to configure, besides which I’d like to use this internally on a linux box. I had a discussion with the author Guillaume Boudreau, who confirmed my initial feeling that the Samba VFS could be easily replaced with another, such as FUSE. I’d like to have a go at doing this work, and it’s very possible — basically, it requires a big merge between Guillaumes VFS and the FUSE based loggedfs. If I had written any C, I could probably do it in a day or so, but as it stands, it is likely to take longer.

As well as home usage, though, this could also be good for the researcher. While a small lab could pay for managed storage, this tends to come in at £1000 per TB, per annum. Most labs don’t need 24/7 recovery though, and the data is often write once, read occasionally. Greyhole would work out for 1TB at 200 quid (for a low-wattage PC server), 100 quid two 1TB discs which would cost, say, 40 quid to power for a year (say, 15W for the computer, 10W for the hard drives, and a bit more for networking, adaptors, USB hubs and so). For lab usage, the drives would probably last 2-3 years at least, while an all solid state computer might last twice this long. More storage space could be added as needed, dropping the cost per TB substantially, although how scalable greyhole is I don’t know.

The general approach could be used more widely, though. As well as JBOD spanning, what about:

Blackhole

The lab runs a local disc for their own data access needs, which is backed up to a institutional data store somewhere off-site. The daemon could be configured to use late night bandwidth, which would only compromise data security slightly.

Whitehole

More in line with my style of science, the local disc would be backed up to a public accessible repository. Obviously this would require suitable metadata to describe the status of the data, but everything would be sharable and accessible as it was produced.

Wormhole

Many labs collaborate with one or two others. A wormhole file system would be configured so that data placed on my file share would magically appear, read-only, in one or more places on the internet, using a rsync/ssh pipe. My collaborators data would, likewise, appear on my disc.

Plughole

This would replicate the normal scientific “supplementary data” process for releasing data publically. Essentially, everything on the file system would, after a significant period, be converted into an excel spreadsheet with no column titles or any additional metadata. This would then be placed in a web accessible location for between 2-6 months, before being randomly deleted.

I’m buying a low power consumption PC to try out greyhole in it’s current form, to see how it goes.

I was entertained by a couple of articles recently, one from PLoS Blogs and one from Ed Yong both bemoaning the low social status of bloggers at least in some peoples minds. As the front page of the PLoS blog says:

Blogging is just one of the outlets science journalists use. It’s about time we separate the person from the medium.

Of course, I agree with this. There is some excellent material floating around the blogosphere. But at the same time, there is a subtle irony in all of this. Both of these authors, I think make a similar confusion about the medium. For instance,

my point is that the world of science blogging is populated with some of the best journalists I know.

PLoS Blogs
— Deborah Blum

At the moment, within science, blogging is still see as an appropriate place for Journalism about science, or in some cases scientists describing their personal experience within science. I don’t denigrate this in anyway, but I think to some extent it misses the point. Science blogging should be about scientists. Many of use now use blogging as part of doing science itself; take Allyson Lister’s excellent and extensive meeting or seminar notes. Or Simon Cockell’s experience sharing. Or my own move to just blogging my own papers and grants. And the occasional technological rant.

The blog is not the point here, it is just the tool that we are using to advance our science. This is also the point of my knowledgeblog project; it is not about adding to the blogosphere, it is not about using WordPress for science. It is about better, faster, cheaper scientific communication.

Ironically, it would help to solve Ed Yong’s problem as well. In future, maybe he won’t have to ask for the paper, because it will be on the web, with all the data, for all the world to see.

After this diversion into journalism, this blog will now resume normal service, as a place to describe my science.

I’ve just got around to installing the magnificient kcite plugin that Simon Cockell wrote for knowledgeblog. It’s actually a really simple plugin, but it’s tremedously useful. For instance, I can now cite my own papers on reality (http://dx.doi.org/10.1371/journal.pone.0012258), function (http://dx.doi.org/10.1186/2041-1480-1-S1-S4) or protein classification (http://dx.doi.org/10.1093/bioinformatics/btl208) and all the metadata will be gathered and cited for me in a nice reference list at the end.

Of course, I am used to the good life, and this is still all a bit clunky for me. I wanted support from my text editor. For this blog, I use a tool-chain of Emacs, asciidoc and blogpost. But for references I use reftex mode and bibtex. Now I realise that this is a pretty minority tool-chain, but it seemed to me that it should be possible to get it working. And it is, actually, pretty easy. Very rough and ready, but the lisp is below. Obviously, this will need fiddling with for each user, and I will improve it over time.

But it demonstrates the point, I think. A little bit of glue can produce a pretty good publishing tool chain, relatively quickly.

(add-hook 'adoc-mode-hook
          'phil-asciidoc-reftex-support)

(defvar phil-reftex-citation-override nil)

(defun phil-asciidoc-reftex-support()
  (reftex-mode 1)
  (make-local-variable 'phil-reftex-citation-override)
  (setq phil-reftex-citation-override t)
  (make-local-variable 'reftex-default-bibliography)
  (setq reftex-default-bibliography
        '("~/documents/bibtex/phil_lord_refs.bib"
          "~/documents/bibtex/phil_lord/journal_papers.bib"
          "~/documents/bibtex/phil_lord/conference_papers.bib"
          )))

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
  (if phil-reftex-citation-override
      (progn
        (setq ad-return-value (phil-reftex-format-citation entry format)))
    ad-do-it))

(defun phil-reftex-format-citation( entry format )
  (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[(http://dx.doi.org/)%s[/cite\\]]" doi)))

Bibliography