### Open Access Response to HEFCE

HEFCE is currently asking for feedback on the role of Open Access in the next REF. While I have a a number of technical suggestions, I think that the biggest and best contribution that the next HEFCE could make to the next REF is to state pubically that all journal/conference/venue metadata be removed from papers before they are sent for review.

It is time that we stopped judging books by their cover. It would be a fantastic contribution if HEFCE could take a lead on this. This is my full response.

## Expectations for Open Access

I feel that one key issue is missing from this document. Scientists still have problems in some areas (including mine of computing science) in that the “high-impact” journals or conferences often provide no or prohibitively expensive open access options. In this past, I have refused to publish in these journals because I wish my work to remain open access and instead published elsewhere. However this works directly against my own interests in the current REF as the research will be judged less good. The use of journals as a primary indicator of quality, also works against my ability to choose cheaper venues. Few people believe statements that research will not be judged on publication venue; indeed, as an individual academic, I have even been told to directly comment on the venue in my return.

One simple and yet enormous contribution that procedures for the next REF could make to Open Access is to not to coerce, but to remove this enormous barrier. This could happen simply and straight-forwardly by removing all journal and publication venue metadata from papers when presented to reviewers. Of course, this reviewers could work around this (the data is a google search away), but the message sent by such a step would be enormous.

The general expectations for OA publishing seem reasonable. However, I think, I would add a further specific requirement. Currently, it is very hard to find the location of a green OA copy of any article. Making articles available is not enough; they must be discoverable. Therefore, I would suggest that a specific requirement that a primary identifier (DOI, ISSN, ISBN or URL) must be present in the institutional repository, and this must be visible on the web page and present in computational metadata. Finally, making the paper discoverable is also not enough. There must be computational and human-readable metadata making clear the contents of the paper are Open Access; without this form of explicit statement, the only safe course of action for readers to take is assume the copyright default position that you cannot use the material.

## Institutional Repositories

Despite the significant investment, our experience is that few people ever retrieve data from institutional repositories. Partly, this is because it is difficult to link between articles on a journal website and articles in institutional repositories. As a second problem, institutional repositories provide an inconsistent experience, both for computational and human access. For instance, the presentation of identifiers such as DOIs is inconsistent. Even when present DOIs are often inaccurate, containing syntactic errors, which prevent their usage.

Ultimately, institutional repositories would be much better if there were a single infrastructure maintained at a national level (or international). In fact, a strong exemplar for this already exists in the form of arXiv. The ability to update the could be devolved to individual institutions. An authentication framework for this is already in place through JE-S.

Linking between institutional repositories and subject repositories unfortunately is likely to be difficult from a social perspective; there are many subject repositories and the institutional repositories are not likely to link to them well, because they are not experts in these repositories. This might be more plausible in a single national repository.

The better solution is to enable authors of papers to perform this linking. Scientists who actually care about the links working and being to the correct place are best place do this. This could be supported in the REF, by making linking to data, software or other subject repositories an explicit criteria in REF; this happens in some disciplines (for example, in bioinformatics a clear statement of if and where software is available and under what conditions is often asked for by reviewers).

## Approach to Exceptions

If exceptions are to be for a transitional period, then they any exceptions given should be marked with a “sell-by” date, after which they should no longer be considered valid.

It is worth reiterating that embargoes really only benefit the publishers; ensuring that the REF framework allows academics to choose their publication venue more freely, rather than effectively requiring them to publish in selected “high-impact” venues would enable them to choose venues with short, or no embargo period. The most effective mechanism for achieving this would be to remove all publication venue information from future REF returns. The research would be judged on the basis of the research, and not the publication venue.

## Open Data

There is more complexity behind the requirement for open data than for open access, particularly where the data needs to remain confidential for reasons of data protection. Having said all of this, there are many disciplines (again bioinformatics is an obvious example) where the majority of data is open. Making a decision now to rule this out of scope, for a REF which may be a significant distance in future seems premature.

### The evil a space can do

Recently, I was contacted by a Kcite user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.

So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines , but they are hardly alone in this.

 10.1128/​AAC.01664-10 

However, looking a bit further into this at the binary of this source and we see this:

 00006260: 2020 2020 2020 2020 203c 7370 616e 2063 00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30 10.1128/...AAC.0 00006290: 3136 3634 2d31 300a 2020 2020 2020 2020 1664-10.

The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.

Actually, I suspect that this is not the case, as the DOI is in the page metadata:

 

It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness . A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:

The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.

In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.

Update

Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.

### Why Metadata Must be Useful

Adding metadata to article could be done by many people. This could be the author, and in the ideal world, this would be the author. They know most about the content and are best placed to put the most knowledge into it. But, we have to answer the question, why would they do this? We have previously argued that semantic metadata must be useful to the people who producing it . For this, we need tools that extract and consume this metadata.

I discovered a nice example of this recently while reading an interesting paper from Yimei Zhu and Rob Proctor , investigating how PhD students use various tools to communicate. I was interested in citing this paper. The paper can be found on the web at the Manchester escholar site [webcite]. What metadata is in this page? Well, our Greycite tool is designed for this purpose; unfortunately, it suggests that there is very little in the way of metadata.

I contacted the escholar helpdesk, and they confirmed that there really is no embedded metadata; greycite has not just missed it. The strange thing is that, several days later, I managed to get to a very similar page [webcite]. It has a different layout and colour scheme, but it’s clearly the same. The bibliographic metadata fields (not easily extractable, sadly!) appear identical. However, investigating the metadata in this page, and we see a very different story. It is full of Dublin Core. It’s not ideally laid out, but it is all that we need for citation.

Unfortunately, there is no link between the two, nor do I know why Manchester has these two different pages; perhaps one is designed to replace the other. And, of course, from the point of view of reader, there is no reason why they would suspect that one contains metadata and the other does not.

The point here is not to criticise Manchester library services. Instead, it is to raise the question, why are the two locations so different in terms of their metadata? My suspicion is that the real answer is simple: very few people have noticed, and no one really cares. It might be argued that metadata must be correct to be useful. The evidence suggests that the inverse is true: metadata must be useful to be correct.

There is a chicken-and-egg situation here; why write the tools to operate over metadata when no one is using the metadata. Fortunately, with kcite we have had a gradual path: first we used DOIs, then pubmed IDs, then arXiv, and now any URI at all. And with Greycite, we have used a lot of heuristics, and quite a few metadata formats. While it has been a significant amount of work, metadata is now making our lives easier. This is the way that it must be.

Update

Typographical correction.

### Splitting a Mercurial Repository

The Mercurial repository for KnowledgeBlog has been starting to show the strain for a while now. Firstly, when it was created we were all new to mercurial; for instance it contains the trunk directory which is really a Subversion metaphor. The second problem is that it is a single large repository, which maps to the development directory on my hard drive; there is now a lot of experimental software on my hard drive which I don’t want in a public enviroment, so I am now faced with either an enormous .hgignore or more “untracked” files than tracked. Not ideal.

At the same time, I have more recently moved mostly toward using git; actually, I still think Mercurial is nicer than git; the interface to the commands is cleaner, and the functionality is not that different. However, there is a fantastic UI, magit, for Emacs, while the equivalent for Mercurial is not as good. This is important to me. So, I wanted to try and address both of the issues at the same time; splitting the repository upon, and move to git.

The process for achieving this turned out to be relatively simply; mercurial comes with a fantastic extension called convert. This is actually a general purpose extension to convert from other VCS systems into mercurial; however, it will also convert one hg repo to another. It has the ability to both filter the existing repo and rename locations at the same time. To create my new repository I used these commands:

 mkdir mathjax-latex-hg cd mathjax-latex-hg ## create filemap.txt hg init hg convert --filemap filemap.txt devel-hg-old/ .

which create a new Mercurial repository, and convert the data from the old, tangled repository. The filemap.txt file contains a couple of lines only:

 include trunk/plugins/mathjax-latex rename trunk/plugins/mathjax-latex .

These filter for just the mathjax-latex plugin and move all its files to top level. This is the only part of the process that needs changing to export different parts of the repo, as I done four times now. This now gives me a Mercurial repository in the right shape. Now, we create a new git repo, and import the untangled Mercurial repo into git. Again, reasonably straight-forward. hg-fast-export is the name of the command on ubuntu which is more sensible than original fast-export which is both overly generic, and a hostage to the future.

 cd .. git init mathjax-latex-git cd mathjax-latex-git hg-fast-export -r ../mathjax-latex-hg git checkout HEAD

Finally, the repo needs to be made publicly available, in this case of github. And all is complete.

 git remote add origin git@github.com:phillord/mathjax-latex.git git push -u origin master

Of course, mathjax-latex does not actually need updating, because it is feature complete and working. However, the WordPress plugin page now includes a nasty warning, so I probably need to update it just to avoid this. Bit of a pain, especially the only way of doing this involves updating the Subversion repository, which I don’t actually use. Slightly painful.

### Is Peer Review the Future?

Today, I recieved an email from a journal, asking me if I would review a paper. The paper in question is by, amoung others, Iddo Friedberg, and can be read on arXiv . I’ve known Iddo Friedberg for a while; he was an earlier user of my semantic similarity work , for protein function prediction , and was also the editor for our paper on realism in ontology development . I would have liked to review this paper, and I feel a little bad because I know these things are important for the careers of the scientists.

So, why did I decline? Well, nice and simple; the page charges are just too high. There is no real justification for this as it can be done much cheaper  — £200 or so seems reasonable; more over, I think it is bad for science because it is one of the factors that cause authors to think very carefully, and often save up work for “a bigger publication”. This can delay publication for years after the work has happened. Scientists have to think carefully about their research, and their work; thinking about whether to publish now or later is one piece of baggage that we could do without .

The real irony of the situation though is that the peer-review for this paper is and has already happened. The paper concerns bias in Gene Ontology annotations of protein functions. Iddo posted his work to the various Gene Ontology mailing lists; unsurprisingly, the GO annotation team saw the paper, and Rachael Huntley responded. The academic debate has started, and is in full swing. Others may see and contribute. And, frankly, the quality of the discussion going on there, and the depth of the analysis is higher than I would have given. No journal has been involved; it happened because there is a mailing list which the scientists in question used.

The current peer-review system does not add value; my peers and the scientific debate that does this. And this can, and will happen, regardless of the journals; indeed, in this case, why don’t the journal editors just read the mailing list?

So why do scientists, including myself, continue to publish in this way? It can often be difficult particularly where there are no open access options available . We have to; it’s part, indeed, the main part of our assessment . As I have said before, this is now the only reason I publish in this way .

Having said this, I do have my doubts. I feel somewhat guilty toward Iddo Friedberg, for instance. There is also a degree of hypocrisy in this — I will still submit to journals (for my own sake, of course, but also for my PhD students); will people, perhaps, wish to not review my articles? What would happen if everybody thought like this (here, I can use the Yossarian defence: then I’d be a damn fool to think any different). If I set the bar at £200, then who will I review for? Well, I do review for conferences and workshops where I can. Still, I feel that this is not enough; people review my work, I should review theirs. So, I state here, that subject to some time constraints, I will happily review work that is posted either to the web in this form, or to sites such as arXiv. Reviews will be posted here, on this blog.

I have my doubts; but open access is not enough. Publication must get lighter, faster and much, much cheaper. I would welcome alternative courses of action.

Many thanks to the Simon Cockell and James Malone who peer-reviewed this post, and provided helpful comments. I am also grateful to Iddo Friedberg who gave me permission to use the story about his paper in this way. The opinions expressed here are, however, my own.

Update

In response to feedback from Mike Taylor, it is worth pointing out that I do not review for paywall journals, and have not for quite a while.

### Testing Times for Tawny

Tawny OWL, my library for building ontologies is now reaching a nice stage of maturity; it is possible to build ontologies, reason over them and so forth. We have already started to use the programmable nature of Tawny, trivially with disjoints , as well as allowing the ontology developer to choose the identifiers that they use to interact with the concepts . However, I wanted to explore further the usefulness of a programmatic environment.

One standard facility present in most languages is a test harness, and Clojure is no exception in this regard. Tawny already comes with a set of predicates for testing superclasses, both asserting and inferred, which provides a good basis for unit testing. So, this example using my test Pizza ontology shows a nice example, essentially testing definitions for CheesyPizza — these should in both a positive and negative definition.

 (deftest CheesyShort (is (r/isuperclass? p/FourCheesePizza p/CheesyPizza)) (is (r/isuperclass? p/MargheritaPizza p/CheesyPizza)) (is (not (r/isuperclass? p/MargheritaPizza p/FourCheesePizza))))

While ths is nice, it is not enough in some cases where I wanted to test that things that do not happen. For this I introduce a new macro, with-probe-entities which adds “probe classes” into the ontology — that is a class which is there only for the purpose of a test. In this case, I test the definition of VegetarianPizza to see whether MargheritaPizza reasons correctly as a subclass. Additionally, though, I also check to see whether a subclass of VegetarianPizza and CajunPizza — which contains sausage — is inconsistent. This test could be more specific, as it tests for general coherency, although I do check for this independently. The with-probe-entities macro cleans up after itself. All entities (which can be of any kind and not just classes) are removed from the ontology afterwards; so independence of testing is not compromised).

 (deftest VegetarianPizza (is (r/isuperclass? p/MargheritaPizza p/VegetarianPizza)) (is (not (o/with-probe-entities [c (o/owlclass "probe" :subclass p/VegetarianPizza p/CajunPizza)] (r/coherent?)))))

Of course, a natural consequence of the addition of tests is the desire to run them frequenty; more over, the desire to run them in a clean environment. The solution to this turns out to be simple. Travis-CI integrates nicely with github — so the addition of a simple YAML file of this form enables a Continuous Integration, of both the Pizza ontology and the environment (such as Tawny, for instance).

 language: clojure lein: lein2 jdk: - openjdk7

The output of this process is available for all to read, along with the tests for my mavenized version of Hermit, and also tawny itself. This is not the first time that ontologies have been continuously integrated ; however, the nice advantage of this is that I have not had to install anything. It even works against external ontologies: so we have both GO and OBI. Currently, these work against static versions of GO and OBI. I could automate this process from the respective repositories of these projects, by pulling with git-svn and pushing again to github.

All in all, though, the process of recasting ontology building as a programming task is turning out to be an interesting experience. Much of the tooling that enables collaborative ontology building just works. It holds much promise for the future.