Archive for the ‘Science’ Category

This is a paper we wrote for STLR2011 also published directly on Knowledgeblog

Abstract

The web has moved from a minority interest tool to one of the most heavily used platforms for publication. Despite originally being designed by and for academics, it has left academic publishing largely untouched; most papers are available on-line, but in PDF and are most easily read once printed. Here, we describe our experiments with using commodity web technology to replace the existing publishing process; the resource describing ontologies that we have developed with this platform; and, finally, the implications that this may have for publishing in a semantic web framework.

Authors

Phillip Lord Newcastle University Newcastle-upon-Tyne, UK

Simon Cockell Newcastle University Newcastle-upon-Tyne, UK

Daniel C. Swan Newcastle University Newcastle-upon-Tyne, UK

Robert Stevens University of Manchester Manchester, UK

Introduction

The Web was invented around 1990 as a light-weight mechanism for publication of documents, enabling scientists to share their knowledge, in the form of hypertext documents. Although scientists and later most academics, like the rest of society, have made heavy use of the web, it has not had a significant impact on the academic publication process. While most journals now have websites, the publication process is still based around paper documents or electronic representations of paper documents in the form of a PDF. Most conferences still handle submissions in the same way1. Books on the web, for example, are often limited to a table of contents.

For the authors (certainly from our personal experience), the process is dissatisfying; book writing is time-consuming, tiring and takes a number of years to come to fruition. If the book has one or a few authors, it tends to reflect only a narrow slice of opinion. Multi-author collected works tend to be even harder work for the editor than writing a book solo. Books do not change frequently; they are therefore out-of-date as soon as they are available. Authors feel a greater pressure for correctness, as they will have to live with the consequences of mistakes for the many years it takes to produce a second edition; most scientists welcome feedback, but being asked to justify something you wish you had not said becomes tiresome, especially if you are waiting to update it.

For the consumer of the material (either a human reader, or a computer), the experience is likewise limited. Books on paper are not searchable, not easy to carry around, are often not cheap to buy and more commonly very expensive to buy. For the computer, the material is hard to understand, or to parse. Even distinguishing basic structure (where do chapters start, who is the author, where is the legend for a given figure) is challenging.

All of this points to a need to exploit the Web for scientists to publish in a different way than simply replicating the old publishing process. Here, we describe our experiment with a new (to academia!) form of publishing: we have used widely-available and heavily used commodity software (WordPress [7]), running on low-end hardware, to develop a multi-author resource describing the use of ontologies in the life sciences (our main field of expertise). From this experience, we have built on and enhanced the basic platform to improve the author experience of publishing in this manner. We are now extending the platform further to enable the addition of light-weight semantics by authors to their own papers, without requiring authors to directly use semantic web technologies, and within their own tool environment. In short, we believe that this platform provides a ‘cheap and cheerful’ framework for semantic publishing.

The requirements

The initial motivation for this work came from our experience within the bio-ontology community3. Biomedicine is one of the largest domains for use of ontology technology, producing large and complex ontologies such as the Gene Ontology [28] or SNOMED [27].

As an ontologist, one of the most common questions that one has is: ‘where is there a book or a tutorial that I can read which describes how to build an ontology?’. Currently, there is some tutorial information on the web, there are some books; but there is not a clear answer to the question. Many of the books are collections of research-level papers, or are technologically biased. Currently many ontologists have learned their craft through years reading mailing lists, gathering information from the web and by word of mouth. We wished to develop a resource with short and succinct articles, published in a timely manner and freely available.

We wished, also, however to retain the core of academic publishing. This was for reasons both pragmatic, principled and political. Consider, for example, Wikipedia, that could otherwise serve as a model. Our own experience suggests that referencing Wikipedia can be dangerous: it can and does change over time meaning critical or supportive comments in other articles can be ‘orphaned’. Wikipedia maintains a ‘neutral point-of-view’ which, many are of the opinion, makes it less suitable for areas where knowledge is uncertain and disagreement frequent. Finally, Wikipedia is relatively anonymous in terms of authorship: whether this affects the quality of articles has been a topic of debate [17], but was not our primary concern; pragmatically, the promotion and career structure2 for most academics requires a form of professional narcissism; they cannot afford to contribute to a resource for which they cannot claim credit. Of course, our experiences may not be reflective of the body academic overall; there has, for example, been substantial discussion of the issues of expertise on Wikipedia itself [8]. Although the reasons may not be clear, it is clear that academics largely do not contribute to Wikipedia, and that Wikimedia sees this as an issue [16].

We also had an explicit set of non-functional requirements. We needed the resource to be easy to administer and low-cost, as this mirrored our resource availability; authors should be offered an easy-to-use publishing environment with minimal ‘setup’ costs, or they would be unlikely to contribute; readers should see a simple, but reasonably attractive and navigable website, or they would be unlikely to read.

The Ontogenesis experience

Our previous experience with the use of blog software within academia was limited to ‘traditional’ blogging: short pieces about either: the process of science (reports about conferences, or papers for example); journalistic articles about other peoples research; or, personal blogging, that is articles by people who just happen to be academics. Although we wished to develop different, more formal content, this experience suggests that many academics find blogging software convenient, straight-forward enough and useful.

To test this, we decided to hold a small workshop of 17 domain experts over a two day period, and task them with generating content, conduct peer-review of this content and publish it as articles on a blog.

Terminology and the Process

Like many communities, the blogosphere has developed its own and sometimes confusing terminology. To describe the process we adopted we first describe some of this terminology. A blog is a collection of web pages, usually with a common theme. These web pages can be divided into: posts that are published (or posted) on an explicit date and then unchanged; and pages that are not dated and can change. Posts and pages have permalinks: although they may be accessible via several URLs, they have one permalink that is stable and never changes. Posts and pages can be categorised – grouped under a predefined hierarchy – or tagged – grouped using ad hoc words or phrases defined at the point of use. A blog is usually hosted with a blog engine, such as WordPress that stores content in a database, combines it with style instructions in themes to generate the pages and posts. Most blog engines support extensions to their core functionality with plugins. Most blogs also support comments or short pieces of content added to a post or page by people other than the original authors. Most blog engines also support trackbacks which are bidirectional links: normally, a snippet from a linking post will appear as a comment in the linked to post. Trackbacks work both within a single blog and between different distributed blogs. Many blogs support remote posting: as well as using a web form for adding new content, users can also post from third party applications, through a programmatic interface using a protocol such as XML-RPC or even by email. Posts and pages are ultimately written in headless HTML (that part of HTML which appears inside the body element), although the different editing environments can hide this fact from the user.

Our initial process was designed to replicate the normal peer-review process, with a single adjustment, that peer-review was open and not blind: papers would be world-visible once submitted; the identities of reviewers would be known to authors; all reviews would be public. We adopted this approach for pragmatic reasons. WordPress has little support for authenticated viewing and none for anonymisation. The full process was as follows:

  • Authors write their content and publish using which ever tooling they find appropriate.

  • The author posts their content, categorising it as under review.

  • An editor assigns two reviewers.

  • Reviewers publish reviews as posts or comments. Reviews link to articles, resulting in a trackback from article to review.

  • The author modifies the post to address reviews.

  • Once done to the editors satisfaction, the post is recategorised as reviewed.

Our expectation was that following this process, articles would not be changed or updated; this is in stark contrast to common usage for wiki-based websites. New articles could, however, be written updating, extending or refuting old ones.

Reflections on the Ontogenesis K-Blog

Our initial meeting functioned to ‘bootstrap’ the Ontogenesis K-Blog. This was useful to acquire a critical mass of content, but also, on this first outing, to explore the K-Blogprocess and technology. The setup for the day was the vannilla WordPressinstallation. The day started with a short presentation on the K-Blogmanifesto [22] and an overview of the process, including authoring and reviewing. The guidelines to authors were to write short articles on an ontology subject (a list of suggestions was offered and authors also made their own choices) and to produce the article in whatever manner they felt appropriate. There was a certain level of uncertainty among authors as to the K-Blogprocess (partly because one of the objectives of the meeting was to ‘force out’ the process) and this, naturally, pointed to the need to document the K-Blogprocess so that authors could have the typical ‘instructions to authors’.

This first meeting produced a set of 20 completed and partially completed articles. Some even had reviews. Even on the day itself there was some external interest seen from Twitter. The first external blog post (outside of those produced by attendees) happened during the meeting [19] with a second shortly after [18].

We also held a second content provision meeting and together these generated a collection of articles that felt like an academic book in terms of content, but generated with considerably less effort. This experience was also sufficient to gather requirements on how to improve the K-Blogidea. A useful K-Blogon the K-Blogprocess itself was produced by Sean Bechhofer [13]. There is also a K-Bloglooking back on the first year of the Ontogenesis K-Blog [23].

Several requirements emerged with respect to authorship. The principle of the short, more or less self-contained article was attractive (though the audience were somewhat self-selecting). Authoring directly in the editor provided by WordPress was felt to be poor by those that tried it. Authoring in a favourite editing tool and then publishing via WordPress worked reasonably well for most authors. There were, however, a variety of issues with the mechanism of this style of publishing; referring to articles that will be, but have not yet, been written. To some extent this was an artefact of the day (many articles being written simultaneously), but authors needed to refer to glossaries and articles in progress.

One stylistic issue was the habit of putting full affiliations at the top of an article. The ontogenesis theme presents the first few lines when displaying many articles, but in many cases this was simply showing the title and author affiliation; where it would be more useful to have the first sentence or so of the article itself.

For the whole K-Blog, a table of contents was felt to be important. This would give an overview of contents and a simple place for navigation about the K-Blog. This raised the issue of attribution; the table of contents needed to expose the authors, including multiple, ordered authors. This is not an unsurprising need, as the authors’ scientific reputation is involved. In this vein, making K-Blogarticles citable by issuing of Digital Object Identifiers (DOI) was requested.

For scientific credibility, the ability to handle citations easily was an obvious requirement. Natively, WordPresshas little or no support for styling citations and references. The ability to cite via DOI and, in this field, PubMed identifiers to automatically make links and produce a reference list was felt to be important. Also, having the Ontogenesis K-Blogarticles in PubMed would also be attractive to authors.

The last authorship issue was the mutability of articles. One aim of K-Blogis to enable articles to change in the light of experience and scientific development, as well as a procedural requirement for updates following review. There was felt to be a conflicting need for articles not to change, so that comments and links from other documents work in the longer term.

The last significant issue was the reviewing of articles. The aim was to have this managed by authors choosing reviewers (with editorial oversight). On the Ontogenesis K-Blogday this could work with authors calling across the room for a review. This is, however, not a sustainable approach. WordPress, however, lacks tracking facilities to manage the reviewing process, whether this is done by an author or an editor. The realisation that such management support is needed is not the greatest insight ever gained, but the requirement is there even in a light weight publishing mechanism.

Improvements to the technology

Our initial experiment with the ontogenesis K-Blogsuggested a significant number of issues with the use of WordPressfor scientific publication. In this section, we describe the extensions that we have made or used to the publication process, documentation or to WordPressitself. Following our initial experience with Ontogenesis, we have started to trial these improvements, including through another workshop which resulted in a new K-Blog [12], describing the scientific workflow engine Taverna [24]; work is also in progress on the use of a K-Blogfor bioinformatics [1], and another for public healthcare [3].

Currently, we have 11 plugins extending the basic WordPressenvironment. For completeness, all of these are shown in Table 1. Our theme is also extended in some places to support the plugins. In general, the plugins are orthogonal and will work independently of each other. One advantage of using WordPressis that many of these plugins are freely available, written and maintained by other authors; while other academic publication environments, such as the Open Journal System [5] exist and are relatively widely-used, but WordPress is used to host perhaps 10% of the web, making the plugin ecosystem extremely fertile.

Plugin

Use

URL

Co-Authors Plus

Allows K-Blog posts to have more than one author

http://wordpress.org/extend/plugins/co-authors-plus/

COinS Metadata Exposer †

Provides COinS metadata on K-Blog posts (used by Zotero, Mendeley etc)

http://code.google.com/p/knowledgeblog/

Edit Flow

Gives editorial process management infrastructure

http://editflow.org/

ePub Export

Exports K-Blog posts as ePub documents

http://wordpress.org/extend/plugins/epub-export/

KCite \(\ast \)

Automatic processing of DOIs and PMIDs into in-text citations and bibliographies

http://knowledgeblog.org/kcite-plugin

Knowledgeblog Post Metadata Plugin \(\ast \)

Exposes generic metadata in post headers

http://code.google.com/p/knowledgeblog/

Knowledgeblog Table of Contents \(\ast \)

Produces a table of contents based on a category of articles. Posts are listed with all authors

http://knowledgeblog.org/knowledgeblog-table-of-contents-plugin

Mathjax LaTeX\(\ast \)

Enables use of TeXor MathML in posts, rendered in scalable web fonts

http://knowledgeblog.org/mathjax-latex-wordpress-plugin

Post Revision Display

Publicly exposes all revisions of an article after publication

http://wordpress.org/extend/plugins/post-revision-display/

SyntaxHighlighter Evolved

Syntax Highlights source code embedded in posts

http://wordpress.org/extend/plugins/syntaxhighlighter/

WP Post to PDF

Allows visitors to download posts in PDF format

http://wordpress.org/extend/plugins/wp-post-to-pdf/

Table 1: WordPress plugins employed by K-Blog. Plugins marked with \(\ast \) are written by the authors. Plugins marked with \(\dag \) are modified by the authors.

Reviewing: The initial process was self-managed and required two reviews per article; this was found to be cumbersome. We have addressed this in two ways; first, we have defined a number of different peer-review levels (public review, author review, editorial review [15]), including a light-weight process now being used for Ontogenesis; authors now select their own reviewers, and decide for themselves when articles are complete. Second, we have added software support. Initially, we attempted to use RequestTracker – an open source ticket system, but found the user interface too complex for this purpose. We are now using the EditFlow plugin to WordPress that was designed for managing a review process—albeit a hierarchical rather than peer-review process.

Authoring Environment: The standard WordPresseditor was found impractical by most authors, even for short articles. WordPressdoes provide ‘paste from word’ functionality, but this removes all formatting which defeats the point. While the lack of a good editing environment could have been a significant problem, our subsequent experimentation has shown that it is possible to post directly from a wide variety of tools, including ‘office’ tools such as Word, Google Docs, LiveWriter and OpenOffice. This is in addition to a variety of blog-specific tools and text formats (such as asciidoc), which are suitable for some users. We have added documentation to a kblog (http://process.knowledgeblog.org) to address these. In practice, only LaTeX proved problematic having no specific support. To address this, we have produced a tool called latextowordpress; this is an adaptation of the plasTeX tool, a python based TeX processor, to produce simplified HTML appropriate for WordPresspublishing. Our experience with using the tools is that while none are perfect, sometimes requiring ‘tweaking’ of HTML in WordPress, most reduce publishing time to seconds.

Citations: We have addressed the lack of support for citations within WordPresswith a plugin called kcite. This allows authors to add citations into documents as shortcodes with either a DOI or Pubmed ID (other identifiers can and are being added to kcite). Shortcodes are a commonly used form of markup of the form: [tag att=”att”]text[/tag]; they are often found where a simplified HTML-like markup is desired. A bibliography is then generated automatically on the web server. Requiring authors to add markup to otherwise WYSIWYG tools is damaging to the user experience. We believe that this is soluable, however, by extending bibliographic tools, by developing a ‘kcite’ style-file or template; we have a prototype of this (using CSL [10]) for Zotero and Mendeley, and another for asciidoc with bibtex. It is also possible to just use native tool support in Word or LaTeX, and convert bibliographies to HTML; the disadvantage with this approach is discussed later.

Archiving and Searching: Archiving is primarly a social, rather than technological, problem. A blog engine is fully capable of storing content in the long-term, but authors and readers have to believe that it will do so. As a novel form of academic publishing, K-Blogis not automatically archived by as a scientific journal. However, we have taken advantage of its web publication; the main K-Blogsite is now explicitly archived by the UK Web Archive, as well as implicitly by other web archives. We have enhanced the website with an ‘easy crawl’ plugin–that is a single web page pointing to add articles classified as reviewed. We now support the (technical) requirements for LOCKSS and Pubmed. Simultaneously, this also enhances the searchability of K-Blog, fulfilling the requirements for Google scholar.

Non-repudiability: The K-Blogprocess does not allow authors to make semantically meaningful changes after an article has been reviewed. Unfortunately, it is hard to define ‘semantically meaningful’ computationally, so we have made no attempt to address this by locking articles; rather, all versions of articles are now accessible to the reader (WordPressprovides this facility to the authors by default). This enables community enforcement of a no-change policy.

Multiple Authors: We believe that authoring is best done outside WordPress. This also means that we do not support multiple-authorship; we have made no attempt to add collaborative features to WordPress. However, we did need articles to carry a byline attributing the articles to multiple authors; although not critical to the functioning of a K-Blog, it is socially critical to appease the professional narcissism (see Section ) of scientists. Fortunately, this is a common requirement, and a suitable WordPressplugin existed.

Identifiers: WordPress already supports permalinks; although we believe that URLs are entirely fit for purpose technologically while DOIs do little other than introduce complexity [11], K-Blogrequired DOIs for professional narcissism. We considered becoming an DOI authority, but this proved impractical. Instead, we have used DataCite [2]. This has required a small extension to WordPress to extract appropriate metadata and to store the DOIs once minted.

Metadata: K-Blognow uncovers various parts of its metadata in a number of ways; unfortunately, there appear to be a large number of (non-)standards in use, each with its own application. K-Blogcurrently provides: COiNS, enabling integration with Zotero and Mendeley; meta tags for Google Scholar; and Dublin Core tags for no specific reason than completeness. We are in the process of providing bibtex export (for bibtex!), and a JSON representation to support citeproc-js [14] in the second generation of kcite.

Mathematics and Presentation: We have also provided several pieces of technology that did not stem from concrete requirements arising from the initial Ontogenesis meeting. We have improved parts of the presentation system by adding, for example, syntax highlighting to code blocks. Additionally, we have created the mathjax-latex plugin enabling the use of TeX(or MathML) markup in posts that are then rendered in the browser using scalable fonts. WordPresshas native math-mode TeX support, but using image fonts which do not scale and have an ugly pixelated display.

Discussion

We have been motivated by a lack of enthusiasm for traditional book publishing to devise another mechanism by which we can achieve the same ends. We wished to avoid the downsides of an ‘all or nothing’ approach to creating a ‘static’ paper document that is read by relatively few people due to price. The K-Blogapproach allows authors to publish in a piecemeal fashion; writing only that which they are motivated to write using a mechanism that avoids a third party making arbitrary decisions on formatting with peculiar time-scales.

To avoid all this, the K-Blogis a light-weight publishing process based on commodity blogging software. We have taken an approach of writing short articles around a theme of ‘ontology in biology’; the Ontogenesis K-Blog. At the time of writing we have 26 articles and page viewing numbers that are pleasing (see Figure 1). These statistics are generated by WordPressdirectly, and represent (an approximation of) ‘real’ page reads, with robot and self-viewing removed. This is confirmed by the ten most read articles (Table 2) that reflect our expectations – ‘What is an ontology’ being first. In this sense, we consider the K-Blogprocess to be a success, especially when considered against the circulation of an equivalent book.

Figure 1: Month page view statistics for the Ontogenesis K-Blog.

What is an ontology?

1,737

OWL Syntaxes

1,246

Ontology Learning

882

Table of Contents

740

What is an upper level ontology?

684

Reference and Application Ontologies

630

Protege & Protege-OWL

522

Semantic Integration in the Life Sciences

517

Automatic maintenance of multiple inheritance ontologies

469

Ontologies for Sharing, Ontologies for Use

330

Table 2: Most Viewed articles for the Ontogenesis K-Blog(Totals).

The social processes with K-Blogare largely similar to traditional publishing, with one exception – reviewing is public. While we may have been interested in experimenting with this for principled reasons, in practice we adopted it because we did not know how to support blind anonymous review with WordPress. Open review is not a new idea: Request For Comments are common in standards processes; both Nupedia [4] (the fore-runner of Wikipedia) and H2G2 [6] (which predates Nupedia) use public peer-review. It is still, however, unusual in academia. In our experience from Ontogenesis, it raised no worries from among our contributors, except that reviewers often wanted to be more involved in the proofing, a role normally played by authors low down the author list; open review processes blurs these lines somewhat.

One open area for the discussion is the extent to which authors can, should be and wish to change articles after publication. While the ability to update is inherent in the web, the desire for non-repudiability was considered to be important; the contradiction here appears fundamental, and we do not feel we have reached a good compromise yet. In one sense, our use of the post-revision display plugin solves this problem; even if the article changes, it is still possible to refer to a specific version. However, like all automated versioning tools, many versions get recorded often with very fine-grained changes, which makes selection of the ‘right’ version hard to impossible. We could replace this with an explicit versioning tool, similar to a source code versioning system; but these systems are hard-to-use for those unused to them, as well as being difficult to implement well. An environment like K-Blog, however, does allow rapid publication of and bi-directional linking with articles; combined with typed linking with CiTO, the ability to publish erratum, addendum and second editions may be a better solution.

Our experiences with K-Blog, we think, are useful in understanding how semantic web technology can and will impact on the publication and library process. Both from our initial work with Ontogenesis, and subsequent work with http://taverna.knowledgeblog.org, it has become obvious that good tool support is critical. ‘Good’ in this sense can be straight-forwardly interpreted as ‘familiar’ that in general can be interpreted as MS Word. Our choice of a blogging engine here was (unexpectedly) well-advised, as this form of publication is already supported by many tools. It is also clear that there are many other tools that could be added; while Ontogenesis has the content, for example, that might be found in an academic book, it does not currently have the presentation of the book. Articles are already available as ePUB, and more recent work has used our Table of Contents plugin to provide a single site-wide ePUB of all articles [25]. Pre-existing tools such as Anthologize [9] may also be useful for adding organised collections of articles gathered from the whole.

This has a direct implication on the addition of further semantics to content. On the positive side, the use of WordPress makes semantic additions plausible in a way that many conventional publishing processes do not. For example, the publication of our (PWL, RS) recent paper [20] required conversion from the LaTeX source to PDF (by latex), to another PDF, to a MS Word file (by hand), to XML before arriving at the final HTML form. This process took many weeks, required multiple interactions between the authors and publisher. It still failed to preserve the semantic use (to humans) of Courier font highlighting in-text ontology terms and requiring post-publication correction. The equivalent blog post [21] gave us nearly instantaneous feedback on the final form, allowing us to check that the semantics was present and correct.

The requirements for semantics have, however, to be light. We have concentrated throughout K-Blog on the ease of delivery of content; even with this focus, it is hard. In most cases, asking for more work, for more semantics than authors are used to giving in papers is problematic. For example, I (PWL) attempted to add microformat-based markup to Ontogenesis, again, identifying ontology terms. So far, all article authors have ignored this markup (including, embarrasingly, myself).

One solution to this issue is to ensure that authors themselves benefit directly from extra semantics. For example, the Mathjax-Latex plugin allows WordPressto present mathematics in TeX or MathML markup in the final document, which is more semantically meaningful than the default WordPress behaviour of rendering an image. From the authors perspective, it also enables the use of TeX markup in Word, and the end product scales and looks less ugly on the web page.

With Kcite, we allow the user to embed DOIs or Pubmed IDs; this can be achieved at no cost to the user, if they already use a bibliography tool, as it can transparently produce citations for them using Kcite shortcodes. Development versions of Kcite already allow easy switching of bibliographic style that we hope will become at the option of the author (rather than the website or publisher as is currently the case), and/or the reader. With this additional information, we can also embed more semantics into the end document at no additional cost to the author, using for example the least specific CiTO cites term. However, further use of CiTO that will require the author to decide which term to use, with relatively little gain to themselves, and may require extension to bibliographic tools if we are to maintain transparency of Kcite shortcodes; even if the tools are present, it is unclear whether authors will use them. We note that semantics useful to domain authors is likely to be domain-specific; mathematicians are more likely to care about maths presentation, but less likely to care about Pubmed IDs. We need to be able to extend the publishing model and environment for different journals to cope.

From a technological perspective, we have found the use of shortcodes to be a good mechanism for readers to add semantics. They are simple and relatively easy to understand. In some cases they can be hidden from the user entirely; forcing users to add markup to otherwise WYSIWYG environments such as MS Word is best avoided. Although the direct use of a more standard XML markup would seem more sensible, in practice it requires tool support, as XML markup will be escaped by helpful remote posting tools. Extension of remote posting tools is hard (for tools like MS Word) or impossible (for cloud tools such as Google Docs or LiveWriter). A blogging engine such as WordPress makes it trivial to replace shortcodes both with a presentation format and machine interpretable microformat; for example, the development version of Kcite transforms DOI short codes ([cite]10.232/43243[/cite]) into in-text citations (Smith et al, (2002)) embedded in a span tag (<span kcite-id="10.232/43243">Smith et al, (2002)</span>) that are subsequently transformed into final presentation form within the browser using Javascript. The presentation form can also support additional semantic markup such as CiTO [26].

Although we believe that additional semantics are a good thing, we will not enforce a requirement for additional semantics on authors. If authors choose not to use kcite, then this is their choice. We need to show that they are useful. Our experience with many (non)standards such as CoINS, DOIs, OAI-ORE, LOCKSS is that they are not simple, speaking primarily to publishers or librarians. For a semantic web approach to work, it must focus on authors and readers, as they produce and consume the content. Extracting even light-weight semantics even from authors who are ontology experts is hard. For other domains, the situation may be worse.

Current publishing practices make use of semantic web technology impractical; semantics added by authors are unlikely to be represented correctly if the end product is a PDF typeset by hand. More over, we can see little point adding semantics to individual articles if this is done in a bespoke way. With K-Blog, we have focused on providing both content, and a full process, with review, using existing tools and workflows, adding semantics secondarily or incidentally where we can. As a result, the level of semantics that we have achieved is light-weight. However, we believe that K-Blog and WordPress combined with associated tooling provides all the basic requirements for a publishing process, and that it provides an attractive framework on which to build a semantic web.

Acknowledgements

We would like to acknowledge the contribution of the authors of articles for both the Ontogenesis and Taverna K-Blog, whose feedback was essential for this process. K-Blogis currently funded by JISC.

Bibliography

[1]

Bioinformatics. http://bioinformatics.knowledgeblog.org.

[2]

Datacite. http://datacite.org/.

[3]

Health and Public Health. http://health.knowledgeblog.org.

[4]

Nupedia. http://en.wikipedia.org/wiki/Nupedia.

[5]

Open Journal System. http://pkp.sfu.ca/?q=ojs.

[6]

The Guide to Life, the Universe and Everything. http://www.bbc.co.uk/h2g2/.

[7]

WordPress. http://www.wordpress.org.

[8]

Wikipedia:expert retention, 2008. http://en.wikipedia.org/wiki/Wikipedia:Expert_retention.

[9]

Anthologize, 2010. http://anthologize.org/.

[10]

Citation style language, 2010. http://www.citations-styles.org.

[11]

The problem with DOIs, 2011. http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/.

[12]

The Taverna Knowledgeblog, 2011. http://taverna.knowledgeblog.org.

[13]

Sean Bechhofer. Reflections on blogging a book. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/647.

[14]

Frank Bennett. Citeproc-js. https://bitbucket.org/fbennett/citeproc-js/wiki/Home.

[15]

Simon Cockell, Dan Swan, and Phillip Lord. Knowledgeblog types and peer-review levels. Process, 2010. http://process.knowledgeblog.org/archives/19.

[16]

Zoe Corbyn. Wikipedia wants more contributions from academics, 2011. http://www.guardian.co.uk/education/2011/mar/29/wikipedia-survey-academ% ic-contributions.

[17]

Casper Grathwohl. Wikipedia comes of age. The Chronile of Higher Education, 2011. http://chronicle.com/article/article-content/125899/.

[18]

D. Kell. Metabolomics, food security and blogging a book, 2010. http://blogs.bbsrc.ac.uk/index.php/2010/01/metabolomics-food-security-b% logging-book/.

[19]

Jim Logan. What is an ontology? | ontogenesis, 2010. http://ontogoo.blogspot.com/2010/01/what-is-ontology-ontogenesis.html.

[20]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology. PLoS One, 2010.

[21]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology, 2010. http://www.russet.org.uk/blog/2010/07/realism-and-science/.

[22]

Phillip Lord and Robert Stevens. The Ontogenesis Manifesto, 2010. http://ontogenesis.knowledgeblog.org/manifesto.

[23]

Phillip Lord and Robert Stevens. Ontogenesis: One year one. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/1063.

[24]

Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurr. Comput. : Pract. Exper., 18:1067–1100, August 2006.

[25]

Peter Sefton. Making epub from wordpress (and other) web collections, 2011. http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-% and-other-web-collections/.

[26]

David Shotton. CiTO, the Citation Typing Ontology. Journal of Biomedical Semantics, 1(Suppl 1):S6, 2010.

[27]

M.Q. Stearns, C. Price, K.A. Spackman, and A.Y. Wang. SNOMED clinical terms: overview of the development process and project status. In AMIA Fall Symposium (AMIA-2001), pages 662–666. Henley & Belfus, 2001.

[28]

The Gene Ontology Consortium. Gene Ontology: Tool for the Unification of Biology. Nature Genetics, 25:25–29, 2000.

Josh Brown from JISC has given his permission for me to reproduce the feedback from the peer-reivew of my last JISC grant which bounced. A shame, as it would have provided us with an opportunity to test out knowledgeblog on papers from the wild, while also producing an great demonstrator of the advantages of using the web to distribute papers with web technology rather than just dumping a link to a PDF.

With luck, we can rejuvenate this work in another way.

“One bid (Bid no 8: Newcastle University) was flagged by one of the markers as being out of scope, despite receiving good marks and positive comments from the other two markers.

The original terms of the call specifically state that projects must add value to existing peer reviewed journals. Projects seeking solely to create new publications are specifically excluded. (Please review the sections Expected Outputs and Requirements of the call for more detail on these conditions.)

Bid no 8 states:

“we will identify authors within Newcastle, take their open-access publications and recast them into a form suitable for WordPress”

The bid is clearly designed to aggregate content that has been published elsewhere, largely based on content held within Newcastle’s institutional repository. No existing, peer-reviewed scholarly journal is involved in this project.

While the creation of a web-native publishing tool clearly has merit, as identified by the two markers who praised this bid, the funding call is, as stated, intended to add value to existing publications. In the absence of an existing peer-reviewed publication as a partner in this project, the bid is out of scope”

The panel agreed with this analysis, which meant that, despite the fact that the project was viewed unanimously as very strong proposal on its own merits, we were obliged to decline to fund this project. The requirement for direct partnership with an existing peer-reviewed scholarly journal for all projects in this strand was imposed after lengthy discussion, and for a range of reasons, including sustainability, tight time-frames and so on, and it was felt that this should be upheld.

— Josh Brown

So, to start with a rant.

I have reached a key and pivotal point in my life. I have decided that I never, ever, ever want to see permalinks with any semantics in them, ever again. And before any one gets clever, yes, I know that this post has semantics in its permalink.

Recently I was looking through Knowledge Blog and realised that I have made a mistake with the permalink structure. When we created Ontogenesis I used semantic links — that is permalinks with the title of the article in them, because I thought that they would be more popular with authors and easier to remember. However, I didn’t want name clashes, land grabs or disambiguation of the sort that you get on Wikipedia(website). So I added in a date as well as a uniquish identifier. I realised quickly that I had manage to combine the worst of both worlds; people wished to change the titles of their articles, and the permalinks no longer fitted. And the links were still hard to remember. So I moved ontogenesis onto the simple number-based permalink structure that it has today. As a concession to usability, I didn’t use the basic ?=192 that is the default, but instead the rewritten 192 which is easier. As far as I can tell, WordPress remembers old permalinks — they do not just go away when the overall structure is changed and links are preserved. They really are as permanent as these things go.

But I had fixed the other knowledgeblogs subdomains consistently. My update to Process which defines and documents the process of knowledgeblog itself was still set up with the older style identifiers. So I changed it; for example, http://process.knowledgeblog.org/archives/19 became plain http://process.knowledgeblog.org/19. I don’t understand why, as WordPress seemed to maintain the links last time, but apparently this broke an email Dan Swan had sent out advertising out Bioinformatics Write-a-thon.

While I have generally purged semantics from links, WordPress still maintains the “title as link” approach for pages, as opposed to posts. I guess this makes sense, as you generally don’t have that many pages, but in this case it has shot me in the foot. I started to re-create a “Who are we” page for the www main domain of knowledgeblog. This ended up with a URL of http://www.knowledgeblog.org/who-are-we; but then I got distracted and left the job half-done. More I wanted to use my normal editing environment. So I trashed the page. Today, I created another page with the same name. But this got a URL of http://www.knowledgeblog.org/who-are-we-2. Ugly. WordPress would not let me rename this permalink, so I tried resurrecting the trashed post and changing it’s content. For reasons that I don’t understand, this didn’t work either and I ended up with http://www.knowledgeblog.org/who-are-we-3. I tried changing this to http://www.knowledgeblog.org/who which works, but redirects to http://www.knowledgeblog.org/who-are-we-3.

So, WordPress is doing (mostly) the right thing, but it still all worked against me. I don’t understand however, why, WordPress doesn’t allow you to set default permalinks for Pages as well as posts. It should do, but as far as I can tell, it does not.

The irony of this is that this is not a new issue. I even wrote a post about Manchester syntax and OBO which largely revolves around this issue. I know about the importance of semantics-free identifiers, and I should have known better then to make a mess of things this way, but on knowledgeblog and indeed on this blog. It just goes to show that handling change is hard and living with a nasty legacy is often the result. I guess that it is a nice example of the advantages and disadvantages of semantics and the compromises that have to be made in any engineering situation.

I haven’t decided yet, but I think I will change the permalink structure of this blog in a few days time. I am hopefully that existing links will be maintained, but that all future ones will exist only in numeric form. Fingers crossed, it will all work.

This is just a short introduction to Michael Bell, my PhD student. He’s now in the second year of his PhD, and has been looking at annotation in biological databases. More specifically, we are trying to define quality measures for textual annotation, based around the bulk properties of these databases. It’s related to, but distinct from my early work on semantic similarity. The question is whether we can judge the quality of sentences, words or records based on how they have been used previously, and how far they have spread.

Michael has now started to blog his work, following on from my own knowledgeblog work, and our general commitment to open science. As part of his work, he is starting to build web delivered tools, as it is a useful way of navigating the complex knowledge space of biological data. So, his website is also part of his work.

A good example of this recent blog post discusses the creation of word clouds for all historical versions of Swiss-Prot and TrEMBL and, because everyone loves a word cloud, it is well worth a look.

This is latest grant that we have submitted to JISC, in this case for a new application of the knowledgeblog platform. As usual, it is a direct post from word, so there may be a few presentational issues in it.

 

The grant is currently under review; I will post the outcome and any feedback (if possible) once I have a result.

Outline Project Description

In this project, we will generate a large body of web content, demonstrating the applicability of commodity blogging technology as supplement to the Universities existing eprints archive. Through a use of technology pioneered by the JISC funded Knowledgeblog project, we will publish 100+ scientific articles, from a variety of different word-processing environments, in a structured-web capable form rather than as PDF. This content will then be augmented to demonstrate the advantages of leverage from a commodity platform, enabling novel mechanisms of publication.

1. Introduction

1The modern publishing industry has been massively affected by the development of the web. However the impact has been highly varied across different domains. Publications that address news events or encyclopedic knowledge have been very heavily affected; other areas have changed little. The web initially developed from the desires of scientists to share knowledge; in some areas, such as biology, the uptake of web technologies has been little short of extraordinary. It is ironic, therefore, that the publishing of formal academic papers has been affected relatively little by the web. Although, content page listings may have been largely replaced by RSS or email, and papers may be available as HTML, they are still largely constrained by the print requirements, packaged as PDFs, poorly linked, with static figures.

 

2An alternative publication mechanism has already been funded by JISC as part of the “Managing Research Data” programme. As part of the Knowledgeblog project, we have investigated using a publication tool, which integrates well with scientists’ existing work-practices, based around a commodity blogging engine, namely WordPress. There are a number of tools such as Open Journal Systems, or organizations like Scielo which allow the web publication of academic articles. While these have large user bases (OJS — 6000 journals, Scielo — 600), currently, WordPress is used to drive around 10% of the world‘s websites; a user base orders of magnitude larger. WordPress, therefore, performs the basic tasks of publishing articles extremely well, scaling to millions of page hits, enjoys tool support from many word processing environments and benefits from many augmentations for specialist audiences. We have extended this tool with a few specialised extensions of our own and, as a result, made it more suitable for academic publishing. We have then used this tool as the basis for two journals, in this case, aimed at producing educational resources describing ontology technology (http://ontogenesis.knowledgeblog.org), and the JISC-funded Taverna workflow system (http://taverna.knowledgeblog.org).

 

3These two resources are, in effect, “gold open-access” — although not requiring author payment. They present content which has not been presented elsewhere, but was written for the purpose; articles have been (or are progressing through) a formal review process. While this has provided a useful resource, generating over 15k page views, these resources are designed to be coherent in scope; although this is generally a positive virtue, by definition it allows us to investigate the suitability of the tooling for only a small number of articles and a limited domain.

 

4Newcastle University has a strong history in supporting gold open access publication: it was the site for the first open access law journal in the UK (http://webjcli.ncl.ac.uk/). In addition, it also has a large and successful eprints repository (http://eprints.ncl.ac.uk) archive, currently hosting 50k articles or bibliographic records; in this project, we will exploit the eprints archive to provide content, building a substantial knowledge resource; this will both demonstrate the suitability of the Knowledgeblog tool-chain as the basis for green open access publication, the value of this novel form of publication, and provide the vital testing against content “from the wild”, allowing us to extend the suitability of this tool-chain to as many areas of academic discourse as possible.

2. Fit to call

5The project call notes that JISC is or has funded many projects relating to scholarly communication. These include: infrastructural support in the form of institutional repositories; support for open-access; and support for novel mechanisms of publication such as overlay journals. Specifically, theme D – campus-based publishing – is aimed at increasing the capacity of the sector to publish and disseminate research outputs directly. The call also highlights attempts such as the “Beyond the PDF” workshop to move toward more structured forms of knowledge; while, in theory, PDF is capable of supporting relatively rich structuring, in practice, most of the tools which generate files in this format produce a relatively opaque, binary artefact from which it is difficult to extract information, or to repurpose or recast that in any way.

 

6While open-access publishing has made significant strides in the last 10 years, becoming an accepted part of the academic landscape, Gold open-access – the publication of original content – still accounts for the minority of academic publications. Green open-access – author publication of content often published elsewhere – now accounts for up-to 25% of the literature in some fields.

 

7Institutional repositories such as that run by Newcastle (http://eprints.ncl.ac.uk) or author archiving on their website (e.g. http://homepages.cs.ncl.ac.uk/phillip.lord/publications.html) are the most common route for green open-access publication. While increasing access to academic materials is a very positive step, this form of publication is largely limited to providing access to a PDF. From neither the authors, nor the readers point of view, is there significant added value to the publication. For example, our experience is that authors are often equivocal or disinterested in publication in institutional repositories as it is “just-one-more-thing” to do, while maintaining a website requires significant technical expertise.

 

8For this grant, academics at Newcastle supported by the infrastructure provided by the local librarians will provide an alternative; we will identify authors within Newcastle, take their open-access publications and recast them into a form suitable for WordPress. We will do this with their active permission and engagement, using the tooling we have developed or documented as as part of the previously-funded JISC “knowledgeblog” project. Where authors wish to, we will support them in performing this work for themselves; where they do not want “just-one-more-thing”, we will leverage off the existing eprints process, and perform this work for them. In general, this can be performed directly using MS Word, latex or other word-processing software, whichever is the authors’ preferred editing environment. In addition, we will use this process to increase the usability of the tooling, increasing the ability to and likelihood that authors will directly publish their work in fashion. As this proposal is built on existing work from the University eprints archive, library-support is implicit within FEC and not specifically or additionally costed.

 

9Once publications are available in this framework, authors and readers will be able to take advantage of the additional features which come either from WordPress directly, or from augmentations provided or assessed by the WebPrints team. For example, authors will be able to see rich content-access statistics, including page-views, referrer and incoming link information. Published articles will be bi-directional linkable using trackbacks. Authors will be able to add tags, zoomable equations or automatically generated reference lists depending on their level of technical competence. For viewers, category and tag based RSS feeds will be available, searching, bi-directional linking (again!) will be possible. As a result of the work from the previous knowledgeblog grant, all posts will be tagged with metadata, in various forms, and will be available for formal archiving outside of the University.

 

10The publication framework is based around WordPress which is freely available, scalable, stable and hardened by its multiple user base. The system is continually updated, but has a good reputation for maintaining backward compatibility. The authoring framework is based around commodity tools such as Word or latex. Most of the workflow process within Newcastle is pre-existing as part of the eprints service. This project therefore provides a sustainable and novel enhancement to the existing process.

3. Workplan

3.1 WP1 Management, Systems Administration and Set up.

11This work package will fulfil the basic management and administrative tasks required for the project. This will include setup of the repository, styling and theming appropriately for the project; definition of a basic workflow for management of documents and metadata; fulfilment of standard JISC reporting requirements.

12We request additional funding of 1k as part of this work-package for virtual server upgrades (additional disk space), dropbox space to enable document management, and wordpress anti-comment spam support.

3.2 WP2 User documentation.

13Most of the operational, “how-to” documentation is already available: either at http://process.knowledgeblog.org (developed by the JISC funded knowledgeblog project); or, as the repository is based on commodity technology, from many publicly available websites.

 

14However, there will be information specific to the Webprints archive; about copyright, about document management, and about the relationship to the university. For this, we will need to generate some specific documentation.

 

15As the project progresses, we will improve and enhance this documentation, based on our experiences, including for example, statistics on how long author self-deposition takes.

3.3 WP3 Author advertising and Material identification

16We will seek active engagement with our user community, by linking into the current eprints system. Combined with the Newcastle-specific, internal “myimpact” database (which was designed to capture research outputs for the next REF), this will enable us to identify new publications as they come out. In the first instance, we will select material that has been published in open access journals (or where embargo periods, or other conditions allow). We will contact authors individually, inform them of our project, and advising them about the methods for recasting of their paper (see WP4).

 

17We will not preselect on the basis of academic quality, only technical and legal (copyright) grounds. Although the eprints service displays full text as PDF only, the myimpact database in many cases also stores MS Word (or equivalent) formatted data. We will, therefore, prefer papers where this data is available. We will prefer papers which are recent over those which are older. Finally, we will prefer papers which give us a wide spread of authorship and discipline.

 

18Although the focus of this proposal is on the provision of a service for publication of green open access material in a fully web-capable format, we will be happy to receive grey literature, on an author-publication basis.

3.4 WP4 Paper recasting

19This work package will take papers selected as part of WP3 and publish them to the webprints archive. In most cases, this work will be performed using tooling developed or documented by the previously funded JISC knowledgeblog project.

 

20We will publish articles in three ways:

Webprints team published. All work will be performed by members of the Webprints team. For each paper, we will write a short report, describing any issues with the publication process, and any errors seen (which we will hand-correct). We will gather statistics on the time taken to publish. Papers will be published on an “as-is” basis; that is we will not seek to enhance the content at this point. We will add metadata in a structured way, which will be accessible from the web presented version.

Author published, webprints supported. We will work directly with authors to publish papers and help them. Where possible, we will augment and add new features (latex maths support, citation). These papers will be marked as featured, and augmented. Again, we will gather statistics on the time taken to publish, broken down for additional functionality.

Author published. Authors will publish directly into Webprints, using either their pre-existing experience, or our own user documentation. We will request, but not require statistical feedback. Publication will be as the author wishes — as-is, or augmented with additional functionality.

 

21All papers will be annotated with standard metadata in a structured form; our previous work means that this metadata will be available from the web presentation of the paper.

3.5 WP5 Repository and process enhancement

22For this package, we will focus on two key aspects: tooling for publishing papers and their presentation once there.

 

23For the presentational issues, in the first instance we will focus on enhancements which do not require support from the article material. For example, as we will add metadata to articles, which will allow us to generate metadata headers (CoINS, standard meta tags etc) without further analysis of the article material itself. Likewise, our experience with the knowledgeblog project means that we can support “out-of-the-box”: multiple export formats (including HTML, PDF and ePUB); site wide indexes (by year, author, subject etc); comments; trackbacks and page feeds (including from subsections). Through use of third-party software, we will also be able to add: related papers through textual analysis; tag clouds; twitter backs; automated multi-lingual presentation and social networking support.

 

24We will also investigate enhancements which require modification of the original content (and therefore increased interaction with authors). From the knowledgeblog project these will include: scalable equation presentation; and client-side generated bibliographies. We will also add “custom posts” for supplementary material (spreadsheets for instance). And, finally, through the use of third-party material, enhancements such as syntax highlighting, zoomable maps, slideshows and so forth. This part of the proposal is designed to be open-ended and exploratory; which forms of enhancements, we pursue will depend on the types papers selected and interactions with the authors. There are currently over 13,000 plugins available for wordpress, which provides us with a considerable resource to build from.

3.6. Timetable

Name

Begin date

End date

Resources

WP1.1 – Setup Repository

02/05/11

14/05/11

SC, AL, DS

WP1.2 – Document Workflow

02/05/11

14/05/11

PL

WP2.1 – User Documentation

09/05/11

24/05/11

DS, PL

WP2.2 – User Statistics

16/05/11

31/08/11

SC, AL

WP3.1 – Author Engagement

16/05/11

31/08/11

SC, AL, DS, PL

WP4.1 – Paper Recasting

01/06/11

30/09/11

SC, AL, DS, PL

WP5.1 – Repository Enhancement

01/07/11

30/09/11

SC, AL, DS, PL

4. Deliverables

25A repository of open-access articles in a fully web-capable format. This will act as a supplement to the existing eprints archive at Newcastle. We expect to generate around 100 articles in this form, although this is likely to be an underestimate. We are currently estimating throughput from our experiences with Knowledgeblog, which involved relatively few articles. The process should benefit from high-throughput experience. Further documentation, published on http://process.knowledgeblog.org, describing the process that we have used to set up this repository. Enhancements to tooling, enabling others to publish more easily in this manner. Additional experience and software enhancing the presentation of data held in this form.

5. Project management arrangements

26The project will be managed by Dr Lord, who will be responsible for:

  • Developing Project Management Plans;
  • Ensuring that the Project work package objectives are met;
  • Prioritising and reconciling conflicting opportunities;
  • Reporting and collaborating with JISC programme manager
  • Dissemination of research results.

 

27Project progress will be evaluated through scheduled, short, “stand-up” meetings on a weekly basis, conducted face-to-face, via Skype or phone as appropriate. Primary unscheduled communication will be via public mailing list, ensuring maximum visibility and openness. We will use other readily available tooling to manage the document process pipeline – Google spreadsheets, dropbox, and likewise for software development (Google code). All staff are associated with other projects or service provision (research, teaching, training); they will be individually responsible for managing these workloads, and are highly experienced at doing so.

5.1 Risk Management

28Staff risks – the basic organisation of the project has been designed to mitigate against staffing issues. All staff are in post and are highly experienced, with long-track records at Newcastle. Costs have been split three ways, therefore even if in the unlikely event that one member of the team leaves during the project, it will not cause significant distruption.

29Software risks – we are using commodity technology, which is very well proven and supported. None of the software is critical (even our basic blogging engine, wordpress, is replaceable). Therefore, while changes in third-party software might degrade or slow progress, it will not halt.

30Engagement Risks – the project requires a level of engagement from Newcastle researchers, which may not materialize. We have minimized this risk by minimizing the effort the engagement takes on behalf of the researchers. The project members are well known to many in the university (DS and SC comprise the “Bioinformatics Support Unit” and have worked for many PIs personally). We have active engagement from the library, in particular from Moira Bent (Science Faculty Liaison Librarian), and Paula Fitzpatrick (Digital Libraries).

5.2 IPR position

31The bulk of the content handled by this work will come from authors within the University. The current restrictive copyright requirements of many publishers place uncertain limits on what can or cannot be done with this content. For this reason, we will use articles that have been published with or have become available under creative commons or other open access license.

 

32Project members will release written work (documentation etc) under a Creative Commons Attribution ShareAlike 3.0 Unported License (CC BY-SA), which allows re-use and modification for non-commercial purposes with attribution. This is in line with the JISC Model Licence. Software linked to WordPress will be released under GPL, as required by the WordPress license. Software which is separable will be released under LGPL. Software linked to other third-party libraries may use other license if required; this will be limited to Free/Open source licences.

 

5.3 Sustainability

33This project is largely based around innovative, novel and leading use of existing software. As such the sustainability of the majority of the technology base is not dependent on project members but large companies with established and proven business models.

 

34The WebPrints archive will be run from the same server as knowledgeblog.org; this is being developed and maintained and will be for the foreseeable future, and the additional of the WebPrints archive will not be a substantial additional cost. However, should this cease to happen, the content of the WebPrints archive will be creative commons or an equivalent permissive license. This will make it possible for the JISC funded UK Web Archive to store the website for the future.

 

35Although, we will not be able to sustain publication by the WebPrints team past the lifetime of this proposal without further funding, author publication will be possible; our experience with existing tooling is that this is possible for many, although requires some level of technical skill, depending on the word-processor package, and level of complexity of the paper.

5.4 Staff Recruitment

36All staff are already in post. Recruitment during the project will therefore be unnecessary.

5.5 Key Beneficiairies

37Our immediate beneficiaries Newcastle University staff, who will have their work published using a new and novel publication technique. Critically, we will demonstrate the value of this form of publication technique to both researchers and librarians within the University who will in future be better placed to use or support this technology to publish their own or others work in future.

 

38Although presented here as a discrete project, the work fits within the background of the wider blogging community. So, our own knowledgeblog project and website will be able to take advantage of software improvements that will happen as a result of this work. Additionally, the general academic blogging community will gain a new resource. Increasingly, this community is a critical path for public engagement in the academic process.

5.6 Community Engagement

39Community engagement will take place initially by direct contact; we will email authors to ask for their engagement in the publishing process. This should have the secondary effect of advertising the presence of our project. We have active engagement from the library staff, who are well known within the University. In terms of engagement with the resource outside of Newcastle, we will make active use of various web and social networking facilities. Our experience has shown that this can generate significant amounts of engagement in a relatively short period of time. Finally, we will advertise the work through standard academic channels of conference and journal publication; although effective, this tends to be slow. This is problematic for a short project, hence we consider this to be a secondary means of communication.

 

6. Budget

 

Removed for privacy reasons.

7. Project Team

 

40Dr. Phillip Lord is a Lecturer of Computing Science at Newcastle University. He has a PhD in yeast genetics from University of Edinburgh, after which he moved into bioinformatics. He is well known for his work on ontologies in biology, as well as his contributions to eScience beginning with his role as a RA on the myGrid project. Since his move to Newcastle, he has been an investigator on there more eScience projects; CARMEN, ONDEX and InstantSOAP, as well as maintaining an active engagement in standards development (OBI, MIGS, MIBBI), and publishing on the fundamentals of ontology design. He is a active participant in the Scientific Blogging community, developed the initial idea for knowledgeblogs. As well as managing the knowledgeblog project, he is the developer of tools such as “Latextowordpress”, as well as WordPress plugins such as “Mathjax-latex” and “Kcite” all of which improve the usefulness of wordpress for academic communication.

 

41Dr. Daniel Swan has a PhD in developmental biology and continued to work in developmental biology as a post-doctoral researcher before moving into bioinformatics in 2001. Subsequent positions included working for Bart’s and the London Genome Centre and the Centre for Hydrology and Ecology in informatics driven roles dealing with large, distributed biological datasets generated by large user communities. Currently the manager of the Newcastle University Bioinformatics Support Unit, he leads a small team aiding biological researchers generate, capture, store and analyse their digital data. His interdisciplinary background means he has grounding in both computer and biological sciences and is comfortable working on CS focused projects (CARMEN, InstantSOAP, Bio- Linux) as well as acting in a research capacity analysing high-throughput data. He is currently active within the knowledgeblog project, having been responsible for adding software support for a review process, gravatars, syntax highlighting, PDF and ePUB exports.

 

42Dr. Simon Cockell has a PhD in Genetics from Leicester University, and refocussed into Bioinformatics with a Masters degree from Leeds in 2005. From there he moved to Newcastle, and the Bioinformatics Support Unit. Since coming to Newcastle, Simon has worked on a range of projects involving large scale analyses (AptaMEMS-ID), data integration (Ondex) and health informatics (MRC Mitochondrial Disease Cohort). He is currently active within the knowledgeblog project, having been responsible for metadata support (including Coins), navigational support (for both humans and robots) and is a co-author of kcite and mathjax-latex.

 

43Allyson Lister worked for 6 years at the EBI in Cambridge, developing and producing the UniProt/TrEMBL protein database. She is currently focusing on the use of ontologies for the semantic integration of systems biology data with her current job at CISBAN in Newcastle University. Both at the EBI and at Newcastle University, she developed structured data formats including UniProt/TrEMBL and SBML. She has also been an early adopter of blog technology as a mechanism for communication of both her own and others primary research. Since 2006, she has co-authored a number of posts with other bloggers in the community and has been invited to be a guest author at both the ISCB news and the BioSharing blog. She has published papers highlighting the importance of social networking and live blogging to bioinformatics.

Paola Marchionni of JISC has give her permission to reproduce the feedback from the peer-review of my last JISC grant which sadly failed. I want to publish it here, as part of my desire for open science rather that as an opportunity to reply which, perhaps unfortunately, the JISC process does not otherwise allow.

I am a little surprised by some of the comments, to be honest. The main criticism was more expected though, which essentially says “it’s not crowd-sourcing if you pay people to develop content”. You have to try these things, but I did think that actually paying for content might be considered to be a little revolutionary. Ah, well, better luck next time.

Markers felt the form of this proposal was “robust”, however there wasn’t enough clarity on the deliverables and especially on how the value of what was being produced would be assessed down stream. They felt there was also some lack of information on how the currently JISC funded K-Blog project, due for completion in July 2011, related to this project and what the impact on its team would be, which seems to be the same team as the one proposed for this project.

The main concerns, however, were around whether this could really qualify as a crowdsourcing or community project – it was felt it was more about disclosing data than community engagement – also considering that the authors of the articles would be paid. There were some doubts about the sustainability of the project beyond the 7 months duration of the funding, as lack of funding would prevent more articles being created and metadata added by the team. One marker also felt that a risk analysis should have taken into account the risk of disparate communities not being aware of the content and using and engaging with it. A more clear identification of the various communities the project aimed to reach and a more targeted strategy for engaging with such communities would have been useful.

Finally, another issue that was raised was that there wasn’t sufficient information on how the partnership with Manchester University would work, either formally or informally, and the dissemination plans could have been stronger, as they relied mainly on the role of K-Blog.

— Paola Marchionni