## Archive for the ‘Science’ Category

Today, I was pleased to release version 1.5 of kcite. Follwing in my tradition of being unable to get the WordPress plugin release process to work correctly, shortly after I released 1.5.1 which is the same thing, but with the correct metadata.

I’m quite pleased with this release. There have been some underlying changes to the technology which I will describe in another post, but for now I want to focus on what I feel represents a substantial improvement on previous versions in terms of functionality. The previous release added support for client-side rendering, which made things look nicer and will add more functionality into the future. However, from an authoring perspective this did not provide much advantage.

For the 1.5 release, I wanted to add new forms of identifier. Kcite started with the ability to cite digitial object identifiers to papers, as this reference to one of my papers shows . As a bioinformatician, references to PubMed also seemed like a good idea (21414991), even if, in most cases a doi could be used also.

However, for this release, I wanted to expand into two new areas. Kcite has come from the kblog project where we have been trying to improve the formality of publication using a blog engine, so that it can become an recognised part of the scientific literature. Organisations like arxiv have done much the same thing (with far greater success!) with preprint. It makes sense then that kcite can now also link straight through to here also . Likewise, we wanted to support the push to data citation, which we have achieved courtesy of datacite. It is now possible to reference data sets also .

And, finally, I have tidied up the presentation. Every item now has a visible URI in the reference list. I have made it visible because these identifiers should be public and present, not just providing an underlying link structure, although they do this as well now. In text citations link through to the bibliography, but also provide an outlink, direct to the resource also. Before processing, the underlying text citation displays the URI which should aid machine interpretability. This link will also be correct, since it is used to gather the metadata for the citation, rather expecting authors to tack a URI to a citation that already appears correct to them.

As always, new forms of publication raise questions and this release is no exception. We have, for instance, found that kblogs are useful for publishing grey literature such as bio-ontologies. But at the moment, rather embarrasingly, there is no way to reference these articles (or indeed this article itself!) with KCite. And, second, there is an issue of provenance. For instance, the notice describing the 1.4 release of kcite is now formatted with 1.5. It’s changed since it’s original publication because I have upgraded Kcite, as in turn will the presentation on this article change with the next release. I am hoping to address both of these issues in the near future.

## Bibliography

### Thoughts on a Chimney

While I am currently spending a significant amount of my time promoting the idea that blog technology can be, and should be used for serious scientific material, I thought I would make a post of a different and perhaps more traditional vein: that is, a light-weight idea, with no serious research behind it, but Years ago now, I created an Energy Wiki full of daft ideas for making energy. I last revisted this in 2009, with an idea for storing energy at sea. I’d actually forgotten that part of the reason for this was to try out Inkscape, which is part of the reason for this post. I wanted to try a bit of multi-media, that is, a blog post with an image in it. High tech.

So, the idea. One form of renewable is the Solar Updraft Tower, also known as a solar chimney. This works straightforwardly enough: you build a large greenhouse in a desert, with a very large chimney in the middle. The top of the chimney is in cold air, the bottom in hot, and an updraft results; stick a turbine in or at the base of the chimney, and you get energy out.

The problem is to work at all efficiently, you need a big temperature differential, so a tall chimney. This in turn means a wide chimney, both to support a substantial updraft, and for mechanical reasons. Tall means 500m or more. The bottom line of this is that a pretty significant capital expenditure is required, followed by a relatively long pay-back period, which in turn means that the biggest single expense of the project is likely to be interest charges, rather than anything else.

So, my idea, is to use an inflatable chimney instead. Initially, I thought about some kind of helium lifting scheme, but then I realised that this makes no sense; why not use hot air, which after all is what the whole system is designed to generate. Consider, for instance, the following organisation:

Essentially, it’s a traditional balloon with a hole in the middle. Obviously the whole system is stackable — a second balloon could be placed on top of the first and so on. The whole structure could be assembled or dissassembled as desired. Unfortunately, though this would probably take quite a bit of work.

My second thought came from the idea that, while most designs for solar chimneys have the chimney in the middle of the greenhouse, it doesn’t really need to be. A horizontal pipe to the middle would be enough. The chimney could be outside of the greenhouse. The advantage that this brings is that the tower could be raised or lowered in-situ, without the risk of it falling on, and damaging the greenhouse. So my second idea was to build the chimney as a two cylinders, with the gap between the serving as the inflatable, buoyant structure. By pleating the cylinders in opposite directions like so:

the whole structure should concertina up and down. By inflating from the top and deflating from the bottom, it should be possible to raise or lower the entire system by opening and shutting vents at the bottom or top of each section to the inside of the chimney.

One advantage with this system, is that as the chimney gets higher, the temperature differential between the inside and the outside gets greater, which should mean that the taller the tower, the more bouyant the sections get; this should help to keep the entire thing as upright as possible, as will the air travelling through the middle, like some gigantic party blower.

Another addition that cames to mind would be to add inflatable half-toroids around the chimney at regular intervals. With a curve on the top, and a flat bottom-side, the entire thing should operate like an aerofoil, lifting the tower up; so, the windier it gets, the greater the lift, which is just what is needed to keep it as upright as possible. This should mean that the chimney can operate in relatively high wind levels.

This kind of system could even work in concert with a fixed chimney — extending the height by 500m say, and increasing it’s efficiency. It could also act as a supplement — operating only on very hot days when the greenhouse has excess capacity. Or, finally, it could operate while the main chimney was being built, meaning that a plant can start generating income earlier, which should reduce the cost of interest payments.

Of course, this all comes with drawbacks: the ongoing running costs are likely to be a significant; wind will remain a significant factor regardless; and, finally, inflating the tower will using hot air, which will reduce the efficiency of the whole system. Are these flaws significant? Well, as I said, this post is light-weight with no serious research behind it. I have no idea, nor any really clear idea about how to work out these costs. Answers on a postcard please.

### Kblog has been compromised

I have been pushing the idea of Kblogs — scientific publishing using commodity software — for a year or so know. Our main site, Knowledgeblog.org has got around 100 articles now, and has had about 50k page views (or about 4x the number of raw page hits) and has generated a certain presence on the internet. While this is generally good, the price of fame is that we have moved somewhat up the list of potential hack targets. Unfortunately, this has resulted in two compromises on the machine; they were probably not disconnected, although we have no evidence to link the two at the moment.

The first was through the timthumb zero day vulnerability. It involved a code injection into a WordPress installation using a thumb nail generator with a dodgy bit of PhP in it. We cleaned the system up as well as we are able and went from there. Sadly, a couple of days ago, we had a second break in. This was a more serious and directed attack (the timthumb was scripted, and we were one of several thousands of sites to be hit). In this case, the machine has been root compromised, and the web server used to gather username/passwords in a phishing expedition. We do have backups and all of the content. There were a number of things that we could have done to secure the machine further, at least one of which may have prevented the hack, but there are only so many hours in the day.

So, where does this leave us? Is the whole idea of knowledgeblog broken? Personally, I do not think so. While I have been critical of the cost associated with academic publishing, I am aware that it cannot happen for free. Running and maintaining a web server takes money; it is something that we have been doing on a shoe-string for a while, especially since our JISC money ran out. In the couple of years that we have run knowledgeblog, I think that we have learned and shown a lot. As well as page views and content, we have shown that scientific publishing can be easy for the author; that we can generate attractive articles this way; that we can start to embed computational accessible knowledge into these articles. We have shown that we can do peer-review, if we need. We have shown we can archive and preserve for the future. We have shown that knowledgeblog is good for grey literature. We have added DOIs. Multiple authors. Good looking maths. We even have some preliminary stats on how much publication costs from Word doc to website.

At the moment, though, we do not have a business model. It is clear that if we are to move this forward, it needs to be run as a service, managed, and looked after, something which is neither my expertise or desire. The analogy that I have made earlier with Wikipedia is, I think, a good one; it would be good to move this into a foundation status.

The path from here to there is a long one, however. For the moment, we will restore knowledgeblog, and it will re-emerge, although at this time of year, it will take a while. But we look to the future as well.

### The Naivete of Scientists

Although in some disciplines, it is relatively uncontentious, the rise of open access publishing has produced a lot of comment in others. In one of my two disciplines, computing science, this form of publication is still the minority, and still raises comment. For instance, Michel Beaudouin-Lafon has commented suggesting this scientists are highly naive about the costs of publishing. He argues that scientific publishing is intrinsically expensive, and that open access will have negative implication for science as a whole.

Over the years, commercial STM publishing has become a cutthroat business with cutthroat practices and we, the scientific and academic community, are the naive lambs, blinded by the ideals of science for the public good-or simply in need of more publications to advance our careers.

— Michel Beaudouin-Lafon

Personally, I think that “naive” is the wrong word; scientists are often not good at operating in a co-ordinated way. Although, we work together in small groups, and sometimes in large groups, in general, we are still very much a cottage industry; at any one time the number of scientists working in a distinct discipline is not that large, even on a world-wide basis. Of course, this works pretty well for scientific advance; we are not a production industry, but researcher. No one knows the best way forward, and we need to experiment to find out. But it does mean that we often play second fiddle to those capable of more co-ordinated action; compare for example, scientists to the medical community with its tightly controlled professional bodies. Or, of course, the STM publishing industry, particularly as it has become focused in fewer and fewer competing publishers.

For example, ACM spends several million dollars every year to support the reliable data center serving the Digital Library

— Michel Beaudouin-Lafon

Clearly, it is true that the cost of data centres and storage are not trivial. But the cost of servicing data has plummeted over recent years. Scientific papers largely consist of storing words and figures; these do not take up much space. The laptop I am working on has a copy of my email directory; it’s not complete but it carries most of my outgoing email since 1994 and a lot of the incoming; this is a lot of words! But the total size is now less than 5G, which will fit on a 3 pound pen drive, or my phone. Now if ACM were storing research data, then it would be a totally different issue; the costs here are significant, problematic and rising. But they do not.

The ACM might spend several million dollars a year, but the bottom line here is that this does not account for the cost of publishing. The Wikimedia foundation which supports Wikipedia spends around 10 million dollars a year, in total, on one of the top ten websites in the World. This is about the daily cost of the whole scientific publishing industry.

The quality of a journal is typically measured by its impact factor

— Michel Beaudouin-Lafon

And a very bad measurement of journal quality it is too. As someone who works in two disciplines at once, I constantly get hit by this: my best computing publications have laughable impact factors when compared to my bio publications; when judged against computer scientists, however, my bio publications have such high impact factors, that they have to be ignored as outliers.

At $5,000 per publication, my lab is broke. — Michel Beaudouin-Lafon It is not clear where the$5,000 figure comes from, as most open access is less than this. But, anyway, this argument makes no sense. Our labs are already paying a vast amount of money for publications; usually this is squirrelled away in overheads, taken from our budgets before we see the money. And, although it doesn’t happen so much in computing, many journals levy significant page charges.

They are the big pharmaceutical labs and the tech firms who publish very little but rely on the publication of scientific results for their businesses. With author-pay, research will pay so that industry can get their results for free. Is this moral?

— Michel Beaudouin-Lafon

Open access on its own is not enough. we also need public disclosure about the process. Perhaps the examples of the pharmaceutical funding journals directly are unusual. It is not so easy to tell at the moment. In this context, it could be argued that the last thing we need is the pharmaceutical industry paying for the results of science. Of course, conversely, the pharmaceutical industry could argue that they already do pay for the (publically funded) research by way of taxation.

While they are interesting, all of these arguments really miss the point: the pharmaceutical industry already get their results for free, as their subscription fees do NOT pay for the research just its publication. The publishing industry also get the results that they depend on for free or with page-charges by charging the authors. And for every paper that researchers publish for free, they pay more to read someone elses.

So, we are already in the situation that we are told is not moral.

It is important to understand that the scientific community is largely at fault

— Michel Beaudouin-Lafon

There is some truth in the idea that scientific community has let itself walk into the situation, but ultimately I feel, that this is like blaming the financial crises on those recieving subprime mortgages. It is true that it is scientists who submit their best work to expensive closed publishers; but, especially in early and mid “career”, we do this to safe-guard our futures.

The problem with the subscription model is not the model but the fees.

— Michel Beaudouin-Lafon

Quite the opposite. Ultimately, I don’t pay the fees, so how much do I really care? But the subscription model prevents re-purposing, it limits access, it prevents competition. I work at a university as a scientist because I value the ability to be able to swap and discuss my work. I want the general public to be able to access my research. Dissemination of knowledge should be part of my job; I think it is reasonable that I, or my employers, should pay for it.

Which is not to say that the level of fees are fine; they are not. They are far to expensive under any model.

The added value provided by publishers is twofold: reputation (the value of the imprimatur), and archiving (the guarantee that the work will be available forever).

— Michel Beaudouin-Lafon

And this is it? Is this all that we are getting, given the costs? Especially the the reputation comes from the work, not the journal, and the archiving should be a rapidly decreasing cost.

Actually, in practice, I think the current publishing industry brings more value; selection of reviewers, sometimes copy-editing and, critically, advertising of the content. But, again, times have changed, and publishing practice in these areas has not.

The only other area in publishing where authors pay to get published is called the vanity press. Do we really want to enter that model?

— Michel Beaudouin-Lafon

This is a low blow, nor is it true. Many people pay for their own publishing costs. The government pays to publish election results; health service pay to publish public health information; companies pay to publish product safety recalls. All circumstances where the value to the author of public awareness of their content far exceeds the income they would recieve from charging. And the biggest example of this is the advertising industry.

Nor is the implication that this will necessarily result in low quality true. Consider the blogosphere; of course, there is much junk, the standard of science journalism is very high; frankly, when ever respecting sources like the BBC start talking about pixie dust, it’s probably at least as high-standard the as mainstream media.

All this aside, what do I, as a scientist, actually care about? Some of these leap to mind:

• Stable location and content.
• archiving
• peer review
• discovery and selection

Open access was built on the basis of replicating the existing publication. PLoS for example did this precisely so that it did not challenge both the business model and the publication procedure at the same time. How much of the costs stem from this? I think that we, as authors and readers, should know. How much of the millions the ACM spends on it’s data centre is involved in managing access controls, for example? How much on advertising? How much at booths at meetings?

Open access has opened the door, but now we need to challenge and change the process. Hosting data is not free nor is archiving. And, yet, I can find own my website from 2002 and enjoy it’s gaudy colour scheme all again. If this blog post is so exciting to the world, that the load brings the server down, you will be able to read it on coral cache. The peer review is expensive and time-consuming; I know because I’ve organised enough of it for BioOntologies. But then I did not get paid for this and how many of the real costs of peer-review do publishers bear? And discovery and selection? Well, we have google, and I follow my peers on twitter.

Author fees are not a solution. […] Finally, nonprofit publishers should take advantage of their unique position to experiment with sustainable evolutions of their publishing models.

— Michel Beaudouin-Lafon

And on this, I could not agree more. Our experiment with Knowledgeblog suggests that we can get 90% (or 80% or 70% depending on who you ask) with commodity software. It’s only a small start, but then I was on the mailing list that saw the first email about the creation of wikipedia, and that wasn’t long ago.

### Ontogenesis Knowledgeblog: Lightweight Semantic Publishing

This is a paper we wrote for STLR2011 also published directly on Knowledgeblog

# Abstract

The web has moved from a minority interest tool to one of the most heavily used platforms for publication. Despite originally being designed by and for academics, it has left academic publishing largely untouched; most papers are available on-line, but in PDF and are most easily read once printed. Here, we describe our experiments with using commodity web technology to replace the existing publishing process; the resource describing ontologies that we have developed with this platform; and, finally, the implications that this may have for publishing in a semantic web framework.

# Authors

Phillip Lord Newcastle University Newcastle-upon-Tyne, UK

Simon Cockell Newcastle University Newcastle-upon-Tyne, UK

Daniel C. Swan Newcastle University Newcastle-upon-Tyne, UK

Robert Stevens University of Manchester Manchester, UK

# Introduction

The Web was invented around 1990 as a light-weight mechanism for publication of documents, enabling scientists to share their knowledge, in the form of hypertext documents. Although scientists and later most academics, like the rest of society, have made heavy use of the web, it has not had a significant impact on the academic publication process. While most journals now have websites, the publication process is still based around paper documents or electronic representations of paper documents in the form of a PDF. Most conferences still handle submissions in the same way1. Books on the web, for example, are often limited to a table of contents.

For the authors (certainly from our personal experience), the process is dissatisfying; book writing is time-consuming, tiring and takes a number of years to come to fruition. If the book has one or a few authors, it tends to reflect only a narrow slice of opinion. Multi-author collected works tend to be even harder work for the editor than writing a book solo. Books do not change frequently; they are therefore out-of-date as soon as they are available. Authors feel a greater pressure for correctness, as they will have to live with the consequences of mistakes for the many years it takes to produce a second edition; most scientists welcome feedback, but being asked to justify something you wish you had not said becomes tiresome, especially if you are waiting to update it.

For the consumer of the material (either a human reader, or a computer), the experience is likewise limited. Books on paper are not searchable, not easy to carry around, are often not cheap to buy and more commonly very expensive to buy. For the computer, the material is hard to understand, or to parse. Even distinguishing basic structure (where do chapters start, who is the author, where is the legend for a given figure) is challenging.

All of this points to a need to exploit the Web for scientists to publish in a different way than simply replicating the old publishing process. Here, we describe our experiment with a new (to academia!) form of publishing: we have used widely-available and heavily used commodity software (WordPress [7]), running on low-end hardware, to develop a multi-author resource describing the use of ontologies in the life sciences (our main field of expertise). From this experience, we have built on and enhanced the basic platform to improve the author experience of publishing in this manner. We are now extending the platform further to enable the addition of light-weight semantics by authors to their own papers, without requiring authors to directly use semantic web technologies, and within their own tool environment. In short, we believe that this platform provides a ‘cheap and cheerful’ framework for semantic publishing.

# The requirements

The initial motivation for this work came from our experience within the bio-ontology community3. Biomedicine is one of the largest domains for use of ontology technology, producing large and complex ontologies such as the Gene Ontology [28] or SNOMED [27].

As an ontologist, one of the most common questions that one has is: ‘where is there a book or a tutorial that I can read which describes how to build an ontology?’. Currently, there is some tutorial information on the web, there are some books; but there is not a clear answer to the question. Many of the books are collections of research-level papers, or are technologically biased. Currently many ontologists have learned their craft through years reading mailing lists, gathering information from the web and by word of mouth. We wished to develop a resource with short and succinct articles, published in a timely manner and freely available.

We wished, also, however to retain the core of academic publishing. This was for reasons both pragmatic, principled and political. Consider, for example, Wikipedia, that could otherwise serve as a model. Our own experience suggests that referencing Wikipedia can be dangerous: it can and does change over time meaning critical or supportive comments in other articles can be ‘orphaned’. Wikipedia maintains a ‘neutral point-of-view’ which, many are of the opinion, makes it less suitable for areas where knowledge is uncertain and disagreement frequent. Finally, Wikipedia is relatively anonymous in terms of authorship: whether this affects the quality of articles has been a topic of debate [17], but was not our primary concern; pragmatically, the promotion and career structure2 for most academics requires a form of professional narcissism; they cannot afford to contribute to a resource for which they cannot claim credit. Of course, our experiences may not be reflective of the body academic overall; there has, for example, been substantial discussion of the issues of expertise on Wikipedia itself [8]. Although the reasons may not be clear, it is clear that academics largely do not contribute to Wikipedia, and that Wikimedia sees this as an issue [16].

We also had an explicit set of non-functional requirements. We needed the resource to be easy to administer and low-cost, as this mirrored our resource availability; authors should be offered an easy-to-use publishing environment with minimal ‘setup’ costs, or they would be unlikely to contribute; readers should see a simple, but reasonably attractive and navigable website, or they would be unlikely to read.

# The Ontogenesis experience

Our previous experience with the use of blog software within academia was limited to ‘traditional’ blogging: short pieces about either: the process of science (reports about conferences, or papers for example); journalistic articles about other peoples research; or, personal blogging, that is articles by people who just happen to be academics. Although we wished to develop different, more formal content, this experience suggests that many academics find blogging software convenient, straight-forward enough and useful.

To test this, we decided to hold a small workshop of 17 domain experts over a two day period, and task them with generating content, conduct peer-review of this content and publish it as articles on a blog.

## Terminology and the Process

Like many communities, the blogosphere has developed its own and sometimes confusing terminology. To describe the process we adopted we first describe some of this terminology. A blog is a collection of web pages, usually with a common theme. These web pages can be divided into: posts that are published (or posted) on an explicit date and then unchanged; and pages that are not dated and can change. Posts and pages have permalinks: although they may be accessible via several URLs, they have one permalink that is stable and never changes. Posts and pages can be categorised – grouped under a predefined hierarchy – or tagged – grouped using ad hoc words or phrases defined at the point of use. A blog is usually hosted with a blog engine, such as WordPress that stores content in a database, combines it with style instructions in themes to generate the pages and posts. Most blog engines support extensions to their core functionality with plugins. Most blogs also support comments or short pieces of content added to a post or page by people other than the original authors. Most blog engines also support trackbacks which are bidirectional links: normally, a snippet from a linking post will appear as a comment in the linked to post. Trackbacks work both within a single blog and between different distributed blogs. Many blogs support remote posting: as well as using a web form for adding new content, users can also post from third party applications, through a programmatic interface using a protocol such as XML-RPC or even by email. Posts and pages are ultimately written in headless HTML (that part of HTML which appears inside the body element), although the different editing environments can hide this fact from the user.

Our initial process was designed to replicate the normal peer-review process, with a single adjustment, that peer-review was open and not blind: papers would be world-visible once submitted; the identities of reviewers would be known to authors; all reviews would be public. We adopted this approach for pragmatic reasons. WordPress has little support for authenticated viewing and none for anonymisation. The full process was as follows:

• Authors write their content and publish using which ever tooling they find appropriate.

• The author posts their content, categorising it as under review.

• An editor assigns two reviewers.

• Reviewers publish reviews as posts or comments. Reviews link to articles, resulting in a trackback from article to review.

• The author modifies the post to address reviews.

• Once done to the editors satisfaction, the post is recategorised as reviewed.

Our expectation was that following this process, articles would not be changed or updated; this is in stark contrast to common usage for wiki-based websites. New articles could, however, be written updating, extending or refuting old ones.

## Reflections on the Ontogenesis K-Blog

Our initial meeting functioned to ‘bootstrap’ the Ontogenesis K-Blog. This was useful to acquire a critical mass of content, but also, on this first outing, to explore the K-Blogprocess and technology. The setup for the day was the vannilla WordPressinstallation. The day started with a short presentation on the K-Blogmanifesto [22] and an overview of the process, including authoring and reviewing. The guidelines to authors were to write short articles on an ontology subject (a list of suggestions was offered and authors also made their own choices) and to produce the article in whatever manner they felt appropriate. There was a certain level of uncertainty among authors as to the K-Blogprocess (partly because one of the objectives of the meeting was to ‘force out’ the process) and this, naturally, pointed to the need to document the K-Blogprocess so that authors could have the typical ‘instructions to authors’.

This first meeting produced a set of 20 completed and partially completed articles. Some even had reviews. Even on the day itself there was some external interest seen from Twitter. The first external blog post (outside of those produced by attendees) happened during the meeting [19] with a second shortly after [18].

We also held a second content provision meeting and together these generated a collection of articles that felt like an academic book in terms of content, but generated with considerably less effort. This experience was also sufficient to gather requirements on how to improve the K-Blogidea. A useful K-Blogon the K-Blogprocess itself was produced by Sean Bechhofer [13]. There is also a K-Bloglooking back on the first year of the Ontogenesis K-Blog [23].

Several requirements emerged with respect to authorship. The principle of the short, more or less self-contained article was attractive (though the audience were somewhat self-selecting). Authoring directly in the editor provided by WordPress was felt to be poor by those that tried it. Authoring in a favourite editing tool and then publishing via WordPress worked reasonably well for most authors. There were, however, a variety of issues with the mechanism of this style of publishing; referring to articles that will be, but have not yet, been written. To some extent this was an artefact of the day (many articles being written simultaneously), but authors needed to refer to glossaries and articles in progress.

One stylistic issue was the habit of putting full affiliations at the top of an article. The ontogenesis theme presents the first few lines when displaying many articles, but in many cases this was simply showing the title and author affiliation; where it would be more useful to have the first sentence or so of the article itself.

For the whole K-Blog, a table of contents was felt to be important. This would give an overview of contents and a simple place for navigation about the K-Blog. This raised the issue of attribution; the table of contents needed to expose the authors, including multiple, ordered authors. This is not an unsurprising need, as the authors’ scientific reputation is involved. In this vein, making K-Blogarticles citable by issuing of Digital Object Identifiers (DOI) was requested.

For scientific credibility, the ability to handle citations easily was an obvious requirement. Natively, WordPresshas little or no support for styling citations and references. The ability to cite via DOI and, in this field, PubMed identifiers to automatically make links and produce a reference list was felt to be important. Also, having the Ontogenesis K-Blogarticles in PubMed would also be attractive to authors.

The last authorship issue was the mutability of articles. One aim of K-Blogis to enable articles to change in the light of experience and scientific development, as well as a procedural requirement for updates following review. There was felt to be a conflicting need for articles not to change, so that comments and links from other documents work in the longer term.

The last significant issue was the reviewing of articles. The aim was to have this managed by authors choosing reviewers (with editorial oversight). On the Ontogenesis K-Blogday this could work with authors calling across the room for a review. This is, however, not a sustainable approach. WordPress, however, lacks tracking facilities to manage the reviewing process, whether this is done by an author or an editor. The realisation that such management support is needed is not the greatest insight ever gained, but the requirement is there even in a light weight publishing mechanism.

# Improvements to the technology

Our initial experiment with the ontogenesis K-Blogsuggested a significant number of issues with the use of WordPressfor scientific publication. In this section, we describe the extensions that we have made or used to the publication process, documentation or to WordPressitself. Following our initial experience with Ontogenesis, we have started to trial these improvements, including through another workshop which resulted in a new K-Blog [12], describing the scientific workflow engine Taverna [24]; work is also in progress on the use of a K-Blogfor bioinformatics [1], and another for public healthcare [3].

Currently, we have 11 plugins extending the basic WordPressenvironment. For completeness, all of these are shown in Table 1. Our theme is also extended in some places to support the plugins. In general, the plugins are orthogonal and will work independently of each other. One advantage of using WordPressis that many of these plugins are freely available, written and maintained by other authors; while other academic publication environments, such as the Open Journal System [5] exist and are relatively widely-used, but WordPress is used to host perhaps 10% of the web, making the plugin ecosystem extremely fertile.

 Plugin Use URL Co-Authors Plus Allows K-Blog posts to have more than one author COinS Metadata Exposer † Provides COinS metadata on K-Blog posts (used by Zotero, Mendeley etc) Edit Flow Gives editorial process management infrastructure ePub Export Exports K-Blog posts as ePub documents KCite $$\ast$$ Automatic processing of DOIs and PMIDs into in-text citations and bibliographies Knowledgeblog Post Metadata Plugin $$\ast$$ Exposes generic metadata in post headers Knowledgeblog Table of Contents $$\ast$$ Produces a table of contents based on a category of articles. Posts are listed with all authors Mathjax LaTeX$$\ast$$ Enables use of TeXor MathML in posts, rendered in scalable web fonts Post Revision Display Publicly exposes all revisions of an article after publication SyntaxHighlighter Evolved Syntax Highlights source code embedded in posts WP Post to PDF Allows visitors to download posts in PDF format
Table 1: WordPress plugins employed by K-Blog. Plugins marked with $$\ast$$ are written by the authors. Plugins marked with $$\dag$$ are modified by the authors.

Reviewing: The initial process was self-managed and required two reviews per article; this was found to be cumbersome. We have addressed this in two ways; first, we have defined a number of different peer-review levels (public review, author review, editorial review [15]), including a light-weight process now being used for Ontogenesis; authors now select their own reviewers, and decide for themselves when articles are complete. Second, we have added software support. Initially, we attempted to use RequestTracker – an open source ticket system, but found the user interface too complex for this purpose. We are now using the EditFlow plugin to WordPress that was designed for managing a review process—albeit a hierarchical rather than peer-review process.

Authoring Environment: The standard WordPresseditor was found impractical by most authors, even for short articles. WordPressdoes provide ‘paste from word’ functionality, but this removes all formatting which defeats the point. While the lack of a good editing environment could have been a significant problem, our subsequent experimentation has shown that it is possible to post directly from a wide variety of tools, including ‘office’ tools such as Word, Google Docs, LiveWriter and OpenOffice. This is in addition to a variety of blog-specific tools and text formats (such as asciidoc), which are suitable for some users. We have added documentation to a kblog (http://process.knowledgeblog.org) to address these. In practice, only LaTeX proved problematic having no specific support. To address this, we have produced a tool called latextowordpress; this is an adaptation of the plasTeX tool, a python based TeX processor, to produce simplified HTML appropriate for WordPresspublishing. Our experience with using the tools is that while none are perfect, sometimes requiring ‘tweaking’ of HTML in WordPress, most reduce publishing time to seconds.

Citations: We have addressed the lack of support for citations within WordPresswith a plugin called kcite. This allows authors to add citations into documents as shortcodes with either a DOI or Pubmed ID (other identifiers can and are being added to kcite). Shortcodes are a commonly used form of markup of the form: [tag att=”att”]text[/tag]; they are often found where a simplified HTML-like markup is desired. A bibliography is then generated automatically on the web server. Requiring authors to add markup to otherwise WYSIWYG tools is damaging to the user experience. We believe that this is soluable, however, by extending bibliographic tools, by developing a ‘kcite’ style-file or template; we have a prototype of this (using CSL [10]) for Zotero and Mendeley, and another for asciidoc with bibtex. It is also possible to just use native tool support in Word or LaTeX, and convert bibliographies to HTML; the disadvantage with this approach is discussed later.

Archiving and Searching: Archiving is primarly a social, rather than technological, problem. A blog engine is fully capable of storing content in the long-term, but authors and readers have to believe that it will do so. As a novel form of academic publishing, K-Blogis not automatically archived by as a scientific journal. However, we have taken advantage of its web publication; the main K-Blogsite is now explicitly archived by the UK Web Archive, as well as implicitly by other web archives. We have enhanced the website with an ‘easy crawl’ plugin–that is a single web page pointing to add articles classified as reviewed. We now support the (technical) requirements for LOCKSS and Pubmed. Simultaneously, this also enhances the searchability of K-Blog, fulfilling the requirements for Google scholar.

Non-repudiability: The K-Blogprocess does not allow authors to make semantically meaningful changes after an article has been reviewed. Unfortunately, it is hard to define ‘semantically meaningful’ computationally, so we have made no attempt to address this by locking articles; rather, all versions of articles are now accessible to the reader (WordPressprovides this facility to the authors by default). This enables community enforcement of a no-change policy.

Multiple Authors: We believe that authoring is best done outside WordPress. This also means that we do not support multiple-authorship; we have made no attempt to add collaborative features to WordPress. However, we did need articles to carry a byline attributing the articles to multiple authors; although not critical to the functioning of a K-Blog, it is socially critical to appease the professional narcissism (see Section ) of scientists. Fortunately, this is a common requirement, and a suitable WordPressplugin existed.

Identifiers: WordPress already supports permalinks; although we believe that URLs are entirely fit for purpose technologically while DOIs do little other than introduce complexity [11], K-Blogrequired DOIs for professional narcissism. We considered becoming an DOI authority, but this proved impractical. Instead, we have used DataCite [2]. This has required a small extension to WordPress to extract appropriate metadata and to store the DOIs once minted.

Metadata: K-Blognow uncovers various parts of its metadata in a number of ways; unfortunately, there appear to be a large number of (non-)standards in use, each with its own application. K-Blogcurrently provides: COiNS, enabling integration with Zotero and Mendeley; meta tags for Google Scholar; and Dublin Core tags for no specific reason than completeness. We are in the process of providing bibtex export (for bibtex!), and a JSON representation to support citeproc-js [14] in the second generation of kcite.

Mathematics and Presentation: We have also provided several pieces of technology that did not stem from concrete requirements arising from the initial Ontogenesis meeting. We have improved parts of the presentation system by adding, for example, syntax highlighting to code blocks. Additionally, we have created the mathjax-latex plugin enabling the use of TeX(or MathML) markup in posts that are then rendered in the browser using scalable fonts. WordPresshas native math-mode TeX support, but using image fonts which do not scale and have an ugly pixelated display.

# Discussion

We have been motivated by a lack of enthusiasm for traditional book publishing to devise another mechanism by which we can achieve the same ends. We wished to avoid the downsides of an ‘all or nothing’ approach to creating a ‘static’ paper document that is read by relatively few people due to price. The K-Blogapproach allows authors to publish in a piecemeal fashion; writing only that which they are motivated to write using a mechanism that avoids a third party making arbitrary decisions on formatting with peculiar time-scales.

To avoid all this, the K-Blogis a light-weight publishing process based on commodity blogging software. We have taken an approach of writing short articles around a theme of ‘ontology in biology’; the Ontogenesis K-Blog. At the time of writing we have 26 articles and page viewing numbers that are pleasing (see Figure 1). These statistics are generated by WordPressdirectly, and represent (an approximation of) ‘real’ page reads, with robot and self-viewing removed. This is confirmed by the ten most read articles (Table 2) that reflect our expectations – ‘What is an ontology’ being first. In this sense, we consider the K-Blogprocess to be a success, especially when considered against the circulation of an equivalent book.

 What is an ontology? 1,737 OWL Syntaxes 1,246 Ontology Learning 882 Table of Contents 740 What is an upper level ontology? 684 Reference and Application Ontologies 630 Protege & Protege-OWL 522 Semantic Integration in the Life Sciences 517 Automatic maintenance of multiple inheritance ontologies 469 Ontologies for Sharing, Ontologies for Use 330
Table 2: Most Viewed articles for the Ontogenesis K-Blog(Totals).

The social processes with K-Blogare largely similar to traditional publishing, with one exception – reviewing is public. While we may have been interested in experimenting with this for principled reasons, in practice we adopted it because we did not know how to support blind anonymous review with WordPress. Open review is not a new idea: Request For Comments are common in standards processes; both Nupedia [4] (the fore-runner of Wikipedia) and H2G2 [6] (which predates Nupedia) use public peer-review. It is still, however, unusual in academia. In our experience from Ontogenesis, it raised no worries from among our contributors, except that reviewers often wanted to be more involved in the proofing, a role normally played by authors low down the author list; open review processes blurs these lines somewhat.

One open area for the discussion is the extent to which authors can, should be and wish to change articles after publication. While the ability to update is inherent in the web, the desire for non-repudiability was considered to be important; the contradiction here appears fundamental, and we do not feel we have reached a good compromise yet. In one sense, our use of the post-revision display plugin solves this problem; even if the article changes, it is still possible to refer to a specific version. However, like all automated versioning tools, many versions get recorded often with very fine-grained changes, which makes selection of the ‘right’ version hard to impossible. We could replace this with an explicit versioning tool, similar to a source code versioning system; but these systems are hard-to-use for those unused to them, as well as being difficult to implement well. An environment like K-Blog, however, does allow rapid publication of and bi-directional linking with articles; combined with typed linking with CiTO, the ability to publish erratum, addendum and second editions may be a better solution.

Our experiences with K-Blog, we think, are useful in understanding how semantic web technology can and will impact on the publication and library process. Both from our initial work with Ontogenesis, and subsequent work with http://taverna.knowledgeblog.org, it has become obvious that good tool support is critical. ‘Good’ in this sense can be straight-forwardly interpreted as ‘familiar’ that in general can be interpreted as MS Word. Our choice of a blogging engine here was (unexpectedly) well-advised, as this form of publication is already supported by many tools. It is also clear that there are many other tools that could be added; while Ontogenesis has the content, for example, that might be found in an academic book, it does not currently have the presentation of the book. Articles are already available as ePUB, and more recent work has used our Table of Contents plugin to provide a single site-wide ePUB of all articles [25]. Pre-existing tools such as Anthologize [9] may also be useful for adding organised collections of articles gathered from the whole.

This has a direct implication on the addition of further semantics to content. On the positive side, the use of WordPress makes semantic additions plausible in a way that many conventional publishing processes do not. For example, the publication of our (PWL, RS) recent paper [20] required conversion from the LaTeX source to PDF (by latex), to another PDF, to a MS Word file (by hand), to XML before arriving at the final HTML form. This process took many weeks, required multiple interactions between the authors and publisher. It still failed to preserve the semantic use (to humans) of Courier font highlighting in-text ontology terms and requiring post-publication correction. The equivalent blog post [21] gave us nearly instantaneous feedback on the final form, allowing us to check that the semantics was present and correct.

The requirements for semantics have, however, to be light. We have concentrated throughout K-Blog on the ease of delivery of content; even with this focus, it is hard. In most cases, asking for more work, for more semantics than authors are used to giving in papers is problematic. For example, I (PWL) attempted to add microformat-based markup to Ontogenesis, again, identifying ontology terms. So far, all article authors have ignored this markup (including, embarrasingly, myself).

One solution to this issue is to ensure that authors themselves benefit directly from extra semantics. For example, the Mathjax-Latex plugin allows WordPressto present mathematics in TeX or MathML markup in the final document, which is more semantically meaningful than the default WordPress behaviour of rendering an image. From the authors perspective, it also enables the use of TeX markup in Word, and the end product scales and looks less ugly on the web page.

With Kcite, we allow the user to embed DOIs or Pubmed IDs; this can be achieved at no cost to the user, if they already use a bibliography tool, as it can transparently produce citations for them using Kcite shortcodes. Development versions of Kcite already allow easy switching of bibliographic style that we hope will become at the option of the author (rather than the website or publisher as is currently the case), and/or the reader. With this additional information, we can also embed more semantics into the end document at no additional cost to the author, using for example the least specific CiTO cites term. However, further use of CiTO that will require the author to decide which term to use, with relatively little gain to themselves, and may require extension to bibliographic tools if we are to maintain transparency of Kcite shortcodes; even if the tools are present, it is unclear whether authors will use them. We note that semantics useful to domain authors is likely to be domain-specific; mathematicians are more likely to care about maths presentation, but less likely to care about Pubmed IDs. We need to be able to extend the publishing model and environment for different journals to cope.

From a technological perspective, we have found the use of shortcodes to be a good mechanism for readers to add semantics. They are simple and relatively easy to understand. In some cases they can be hidden from the user entirely; forcing users to add markup to otherwise WYSIWYG environments such as MS Word is best avoided. Although the direct use of a more standard XML markup would seem more sensible, in practice it requires tool support, as XML markup will be escaped by helpful remote posting tools. Extension of remote posting tools is hard (for tools like MS Word) or impossible (for cloud tools such as Google Docs or LiveWriter). A blogging engine such as WordPress makes it trivial to replace shortcodes both with a presentation format and machine interpretable microformat; for example, the development version of Kcite transforms DOI short codes ([cite]10.232/43243[/cite]) into in-text citations (Smith et al, (2002)) embedded in a span tag (<span kcite-id="10.232/43243">Smith et al, (2002)</span>) that are subsequently transformed into final presentation form within the browser using Javascript. The presentation form can also support additional semantic markup such as CiTO [26].

Although we believe that additional semantics are a good thing, we will not enforce a requirement for additional semantics on authors. If authors choose not to use kcite, then this is their choice. We need to show that they are useful. Our experience with many (non)standards such as CoINS, DOIs, OAI-ORE, LOCKSS is that they are not simple, speaking primarily to publishers or librarians. For a semantic web approach to work, it must focus on authors and readers, as they produce and consume the content. Extracting even light-weight semantics even from authors who are ontology experts is hard. For other domains, the situation may be worse.

Current publishing practices make use of semantic web technology impractical; semantics added by authors are unlikely to be represented correctly if the end product is a PDF typeset by hand. More over, we can see little point adding semantics to individual articles if this is done in a bespoke way. With K-Blog, we have focused on providing both content, and a full process, with review, using existing tools and workflows, adding semantics secondarily or incidentally where we can. As a result, the level of semantics that we have achieved is light-weight. However, we believe that K-Blog and WordPress combined with associated tooling provides all the basic requirements for a publishing process, and that it provides an attractive framework on which to build a semantic web.

# Acknowledgements

We would like to acknowledge the contribution of the authors of articles for both the Ontogenesis and Taverna K-Blog, whose feedback was essential for this process. K-Blogis currently funded by JISC.

# Bibliography

[1]

Bioinformatics. http://bioinformatics.knowledgeblog.org.

[2]

Datacite. http://datacite.org/.

[3]

Health and Public Health. http://health.knowledgeblog.org.

[4]

Nupedia. http://en.wikipedia.org/wiki/Nupedia.

[5]

Open Journal System. http://pkp.sfu.ca/?q=ojs.

[6]

The Guide to Life, the Universe and Everything. http://www.bbc.co.uk/h2g2/.

[7]

WordPress. http://www.wordpress.org.

[8]

Wikipedia:expert retention, 2008. http://en.wikipedia.org/wiki/Wikipedia:Expert_retention.

[9]

Anthologize, 2010. http://anthologize.org/.

[10]

Citation style language, 2010. http://www.citations-styles.org.

[11]

The problem with DOIs, 2011. http://www.russet.org.uk/blog/2011/02/the-problem-with-dois/.

[12]

The Taverna Knowledgeblog, 2011. http://taverna.knowledgeblog.org.

[13]

Sean Bechhofer. Reflections on blogging a book. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/647.

[14]

Frank Bennett. Citeproc-js. https://bitbucket.org/fbennett/citeproc-js/wiki/Home.

[15]

Simon Cockell, Dan Swan, and Phillip Lord. Knowledgeblog types and peer-review levels. Process, 2010. http://process.knowledgeblog.org/archives/19.

[16]

[17]

Casper Grathwohl. Wikipedia comes of age. The Chronile of Higher Education, 2011. http://chronicle.com/article/article-content/125899/.

[18]

D. Kell. Metabolomics, food security and blogging a book, 2010. http://blogs.bbsrc.ac.uk/index.php/2010/01/metabolomics-food-security-b% logging-book/.

[19]

Jim Logan. What is an ontology? | ontogenesis, 2010. http://ontogoo.blogspot.com/2010/01/what-is-ontology-ontogenesis.html.

[20]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology. PLoS One, 2010.

[21]

Phillip Lord and Robert Stevens. Adding a little reality to building ontologies for biology, 2010. http://www.russet.org.uk/blog/2010/07/realism-and-science/.

[22]

Phillip Lord and Robert Stevens. The Ontogenesis Manifesto, 2010. http://ontogenesis.knowledgeblog.org/manifesto.

[23]

Phillip Lord and Robert Stevens. Ontogenesis: One year one. Ontogenesis, 2011. http://ontogenesis.knowledgeblog.org/1063.

[24]

Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurr. Comput. : Pract. Exper., 18:1067–1100, August 2006.

[25]

Peter Sefton. Making epub from wordpress (and other) web collections, 2011. http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-% and-other-web-collections/.

[26]

David Shotton. CiTO, the Citation Typing Ontology. Journal of Biomedical Semantics, 1(Suppl 1):S6, 2010.

[27]

M.Q. Stearns, C. Price, K.A. Spackman, and A.Y. Wang. SNOMED clinical terms: overview of the development process and project status. In AMIA Fall Symposium (AMIA-2001), pages 662–666. Henley & Belfus, 2001.

[28]

The Gene Ontology Consortium. Gene Ontology: Tool for the Unification of Biology. Nature Genetics, 25:25–29, 2000.

### Feedback on Webprints

Josh Brown from JISC has given his permission for me to reproduce the feedback from the peer-reivew of my last JISC grant which bounced. A shame, as it would have provided us with an opportunity to test out knowledgeblog on papers from the wild, while also producing an great demonstrator of the advantages of using the web to distribute papers with web technology rather than just dumping a link to a PDF.

With luck, we can rejuvenate this work in another way.

“One bid (Bid no 8: Newcastle University) was flagged by one of the markers as being out of scope, despite receiving good marks and positive comments from the other two markers.

The original terms of the call specifically state that projects must add value to existing peer reviewed journals. Projects seeking solely to create new publications are specifically excluded. (Please review the sections Expected Outputs and Requirements of the call for more detail on these conditions.)

Bid no 8 states:

“we will identify authors within Newcastle, take their open-access publications and recast them into a form suitable for WordPress”

The bid is clearly designed to aggregate content that has been published elsewhere, largely based on content held within Newcastle’s institutional repository. No existing, peer-reviewed scholarly journal is involved in this project.

While the creation of a web-native publishing tool clearly has merit, as identified by the two markers who praised this bid, the funding call is, as stated, intended to add value to existing publications. In the absence of an existing peer-reviewed publication as a partner in this project, the bid is out of scope”

The panel agreed with this analysis, which meant that, despite the fact that the project was viewed unanimously as very strong proposal on its own merits, we were obliged to decline to fund this project. The requirement for direct partnership with an existing peer-reviewed scholarly journal for all projects in this strand was imposed after lengthy discussion, and for a range of reasons, including sustainability, tight time-frames and so on, and it was felt that this should be upheld.

— Josh Brown