Abstract

Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

• Phillip Lord
• Lindsay Marshall

Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.

The pain of foreign names

Our original intention with Greycite was to build a tool which can provide bibliographic metadata for any URL, to support my own kcite referencing tool . While it still fulfils this function, it also turns out to be a useful, general-purpose tool for investigating the metadata in various web pages. And this reveals some interesting results. We discovered a nice example of this recently while adding RIS support.

The paper in question comes from EMBO reports . At first sight, the RIS for this page taken from Greycite looks reasonable.

TY - ELEC
UR - http://www.nature.com/embor/journal/v14/n3/full/embor201311a.html
Y2 - 2013-04-18 12:54:54
TI - The economics of creative research
JO - EMBO reports
PY - 2012
DA - 2012-02-08
DO - 10.1038/embor.2013.11
AU - Cou|[eacute]|e, Ivan
ER -

However, something strange is going on with the author; poor old Ivan Couée’s name has been rather broken. So, why is this happening? Looking at the underlying HTML the first thing that hits you is a lot of space; there are over 50 empty lines at the beginning of the file; still, this is only a problem for people strange enough to be reading the HTML.

However, eventually we get to the metadata, first dublin core and what we describe as Google Scholar (since this is where we found it). And there we have it; greycite is reporting the metadata as it is. The author’s name is represented with |[eacute]| as a letter.

<meta name="dc.language" content="en" />
<meta name="dc.rights" content="&#169; 2012 Nature Publishing Group" />
<meta name="dc.title" content="The economics of creative research" />
<meta name="dc.creator" content="Ivan Cou|[eacute]|e" />
<meta name="dc.identifier" content="doi:10.1038/embor.2013.11" />
<meta name="dc.date" content="2012-02-08" />

<meta name="citation_publisher" content="Nature Publishing Group" />
<meta name="citation_authors" content="Ivan Cou|[eacute]|e" />
<meta name="citation_title" content="The economics of creative research" />
<meta name="citation_date" content="2012-02-08" />
<meta name="citation_volume" content="14" />
<meta name="citation_issue" content="3" />
<meta name="citation_firstpage" content="222" />
<meta name="citation_doi" content="doi:10.1038/embor.2013.11" />
<meta name="citation_journal_title" content="EMBO reports" />

As far as we can tell this is an error; HTML attributes or extended character sets are entirely valid, but |[eacute]| does not appear to be a valid representation. Interestingly enough, there also appears to be some slightly buggy code in the PRISM metadata, which I am sure should not be this.

<meta name="prism.issn" content="ERROR! NO ISSN" />
<meta name="prism.eIssn" content="ERROR! NO EISSN" />

My guess is that the problem is at the point of website generation rather than deeper in the bowels of the publishing system; grabbing the metadata for this article from CrossRef by content negotiation shows the correct name.

{"volume":"14",
"issue":"3",
"DOI":"10.1038/embor.2013.11",
"URL":"http://dx.doi.org/10.1038/embor.2013.11","title":"The economics of
creative research",
"container-title":"EMBO reports",
"publisher":"Nature Publishing Group",
"issued":{"date-parts":[[2012,2,8]]},
"author":[{"family":"Couée","given":"Ivan"}],
"editor":[],"page":"222-225",
"type":"article-journal"}

We emphathise with the publishers here. Getting character sets correct is the bane of everyones life; given the state of computing when multi-lingual character sets appeared, we guess it is not an example of premature optimisation, but an example of an optimisation you wish had never happened. The world would be an easier place if everything that used unicode from the start.

The current metadata for this paper can be seen on greycite or in detail. Hopefully, this will be updated in time!

This post was written by Phillip Lord and Lindsay Marshall

Update

Spelling mistake corrected, bibliography added. Thanks to Christian Perfect for bug report.

Bibliography

Overlays over arXiv

Much has been said about overlay journals . The idea is simple; the journal essentially becomes a selector, a channel, with the paper itself being hosted elsewhere, such as arXiv.

This holds a certain amount of attraction for me; I already post my new papers on arXiv. I have been posting them here also . This works well, but is hampered by technology. Mostly I write papers in LaTeX, and I have written tools to make these suitable for WordPress ; these work well enough to publish an entire thesis . However, the process of doing this is not slick . For instance, when trying to publish one of my own papers, I have had problems as I used a theorem environment . While PlasTeX is a nice tool, the key problem is that it is fundamentally a different interpreter from TeX. Eventually, perhaps, LuaTeX will get an HTML backend, but until this happens the system will always fail in some cases.

So, I wanted to investigate whether it was possible to build Overlay functionality into a personal publication framework, such as the WordPress installation I host these articles on. Well, it turns out combined with the tools that I have written for manipulating metadata , it is relatively simple to do so; my first attempt at this is now available for my OWLED 2013 paper . The title, authors (just me in this case), date, abstract and PDF link all come directly from arXiv. Full text is not available from arXiv — anyway it would suffer from all the issues described earlier; in the end, the PDF is probably the best representation of this paper. I have supplemented this with a plain English summary, something that I have wanted to do for years, but have not managed to start. If the reviewers will allow me to do so, I will also attach these when they become available.

The code for this is not quite ready to release yet: however, it will potentially work over any eprints repository, and I have connected it up to Greycite also , so it can be used over any source that greycite can interpret.

All a little clunky, but I think that this is the future. The Journal is dead, Long Live the article.

Update

Fixed DOI.

Bibliography

Archiving of Scientific Material

In this article, I consider the practical issues with archiving of scientific material placed on the web; I will describe the motivation for doing this, the background and consider the various mechanisms for doing so.

As part of our work on knowledgeblog , we have been investigating ways of injecting the formal technical aspects of the scientific publication process into this form of publication. The reasons for this are myriad: if the scientist can control the form, they can innovate in their presentation how they choose; the publication process itself becomes very simple and straight-forward (as opposed to the authoring, which is as hard as it ever way). Finally, it means that scientists can publish as they go, as I have done and am doing on my work with Tawny-OWL. This latter point has many potential implications: firstly, it makes science much more interactive — scientists can publish things that they are not clear on, early ideas and can (and I do) get feedback on this early; secondly, it should help to overcome publication bias as it is much lighter-weight than the current publication process. Scientists are more likely to publish negative results if the process is easy and not expensive. And, lastly, it can help to establish provenance for the work; if every scientists published in this way, scientific fraud would be much harder, as a fraudulant scientist would have to produce a coherent, faked set of data from the early days of the work.

However to achieve this, posts must still be available. The scientific record needs to be maintained. Now, this should not be an issue. I write this blog in Asciidoc , and rarely use images, so the source is quite small. In fact, since I moved to WordPress in 2009 , it totals about 725k; so it would fit on a floppy, which is a crushing blow to my ego. So, how easy is it to archive your content?

The difficulty here is that there is no obvious person to do this. Like many universities, I have access to an eprints archive. Unfortunately, this is mainly used for REF, and has no programmatic interface. The university also has a LOCKKS box. However, this is not generally available for the work that the University staff has produced, but journals that the University has bought; so I have to give my work away to a paywall publisher, or pay lots to an open access publisher to access this.

Another possibility would be to use Figshare. Now, I have some qualms about Figshare anyway; it appears to be a walled garden, the Facebook of science. Others, however, do not worry about this and are using Figshare. Carl Boettiger , for instance archives his note book on Figshare. But there is a problem: consider the 2012 archive; it is a tarball, with Markdown files inside; I know what to do with this, but many people will not. And it is only weakly linked to the original publication link. Titus Brown had the same idea , and claiming the added value of DOIs, something I find dubious . Again, though, the same problem; Figshare archives the source. The most extreme example of this comes from Karthik Ram who has published an Git repository; unsurprisingly, it is impossible to interact with as a repo.

Figshare likes to make great play of the fact that it is backed by CLOCKKS — this is a set of distributed copies maintained by some research libraries. Now, it might seem sensible that CLOCKKS would offer this service (at a price, of course) to researchers. Perhaps they do. But the website reveals nothing about this. And, although, I tried they did not respond to emails either. Rather like DOIs, the infrastructure is build around scale; in short, you need a publisher or some other institution involved; all very well, but this contradicts the desire for a light-weight publication mechanism. There is a second problem with CLOCKKS; it is a dark archive, that is, its content only becomes available to the public after a “trigger event”; the publisher going bust, the website going down and so on. Now data which is on the web and, critically, archived by someone other than the author essentially becomes non-repudiable and time-stamped. I can prove (to a lower-bound) when I said something. And you can prove that I said something even if I wish I hadn’t. In a strict sense, this is true if the data is in CLOCKKS; but in a practical sense, it is not, as checking when and what I said becomes too much of a burden to be useful.

So, we move onto web archiving. The idea of web archiving is attractive to me for one main reason; it is not designed for science. It is a general purpose, commodity solution, rather like a blog engine. If one thing scientific publication needs more than anything, it is to move the technology base away from bespoke and toward commodity.

One of the most straight-forward solutions for web archiving is WebCite; the critical advantage that this has is that it provides an on-demand service. I have been using it for a while to archive this site; greycite now routinely submits new items here, if we can extract enough metadata from them. The archiving is quick, rapid and effective. The fly-in-the-ointment is that WebCite has funding issues and is threatened with closure at the end of 2013. The irony is that it claims it needs \$25,000 to continue. Set against the millions put aside for APCs , or the thousands NPG claims is necessary to publish a single paper , or the millions that ACM spends supporting its digital library , this is small beer, and it shows the lack of seriousness with which we take web archiving. I hope it survives; if it does, Gunther Eysenbach, who runs it, tells me that the plan to expand the services they offer. It may yet become the archiving option of choice.

I have been able to find no on-demand alternative to WebCite. However, there are several other archives available. I have been using the UK Web Archive for a while now. I first heard about this service, irony or ironies, on the radio. Since I first used it to archive knowledge blog and later used it to archive this site, the process has got a lot easier. No longer do I need to send signed physical copyright permission; first it was electronic (email I think). It now appears that the law is changing to allow them to archive more widely (the BBC covered this in a story, categorized under “entertainment and arts” and which is largely focused on Stephen Fry’s tweets), although this will be a dark archive. Currently, this journal has been archived only once; from my other sites, it appears that they have a six month cycle. So, while this provides good digital preservation, it is a less good solution from the perspective of non-repudiablility; there is a significant gap before the archive happens, and a slightly longer one till the archive is published.

The UKWA is, as the name suggests, is UK specific. Another solution is to use, of course, archive.org, which might be considered to be the elephant in the room for web archiving. Unlike the UKWA, they don’t take submissions, but just crawl the web (although I suspect that the UKWA will start doing this also now). Getting onto a crawl can, therefore, be rather hit-and-miss. Frustratingly, they do have an “upload my data” service, which you can access through a logged in account; but not an “archive my URL” service. Again, a very effective resource from a digital preservation resource, but with similar problems to the UKWA from a point-of-view of non-repudiablilty. The archives take time to appear; in my experience, somewhat longer than the UKWA. I have also contacted their commercial wing, http://archive-it.org. Their software and the crawls that the offer could easily be configured to do the job, but unfortunately, they are currently aimed very much at the institutional level: their smallest package provides over around 100Gb of storage; this blog can be archived in around 130Mb (this is without deduplication which would save a lot); even a fairly prolific blogger comes in at around 250Mb. The price, unfortunately, reflects this. Although, again, it is on a par with my yearly publication costs, so is well within an average research budget.

Finally, greycite now generates pURLs — these are two-step resolution URLs which work rather like DOIs (or actually DOIs operate like pURLs, since as far as I am aware, pURLs predate the web infrastructure for DOIs). These resolve directly to the website in question. With a little support greycite can track content as and if it moves around the web; even if this fails, and an article disappears, greycite will redirect to the nearest web archive.

In summary, there is no perfect solution available at the moment, but there are many options; in many cases, archiving will happen somewhat magically. As we have found with many other aspects of author self-publishing on the web, it is possible to architecturally replicate many of the guarantees provided by the scientific publication industry through the simple use of web technology. Tools like greycite and kblog-metadata are useful in uncovering the archives that are already there, and linking these together with pURLs. Taken together, I have a reasonable degree of confidence that this material will be available in 10 or 50 years time. Whether anyone will still be reading it, well, that is a different issue entirely.

Bibliography

Open Access Response to HEFCE

HEFCE is currently asking for feedback on the role of Open Access in the next REF. While I have a a number of technical suggestions, I think that the biggest and best contribution that the next HEFCE could make to the next REF is to state pubically that all journal/conference/venue metadata be removed from papers before they are sent for review.

It is time that we stopped judging books by their cover. It would be a fantastic contribution if HEFCE could take a lead on this. This is my full response.

Expectations for Open Access

I feel that one key issue is missing from this document. Scientists still have problems in some areas (including mine of computing science) in that the “high-impact” journals or conferences often provide no or prohibitively expensive open access options. In this past, I have refused to publish in these journals because I wish my work to remain open access and instead published elsewhere. However this works directly against my own interests in the current REF as the research will be judged less good. The use of journals as a primary indicator of quality, also works against my ability to choose cheaper venues. Few people believe statements that research will not be judged on publication venue; indeed, as an individual academic, I have even been told to directly comment on the venue in my return.

One simple and yet enormous contribution that procedures for the next REF could make to Open Access is to not to coerce, but to remove this enormous barrier. This could happen simply and straight-forwardly by removing all journal and publication venue metadata from papers when presented to reviewers. Of course, this reviewers could work around this (the data is a google search away), but the message sent by such a step would be enormous.

The general expectations for OA publishing seem reasonable. However, I think, I would add a further specific requirement. Currently, it is very hard to find the location of a green OA copy of any article. Making articles available is not enough; they must be discoverable. Therefore, I would suggest that a specific requirement that a primary identifier (DOI, ISSN, ISBN or URL) must be present in the institutional repository, and this must be visible on the web page and present in computational metadata. Finally, making the paper discoverable is also not enough. There must be computational and human-readable metadata making clear the contents of the paper are Open Access; without this form of explicit statement, the only safe course of action for readers to take is assume the copyright default position that you cannot use the material.

Institutional Repositories

Despite the significant investment, our experience is that few people ever retrieve data from institutional repositories. Partly, this is because it is difficult to link between articles on a journal website and articles in institutional repositories. As a second problem, institutional repositories provide an inconsistent experience, both for computational and human access. For instance, the presentation of identifiers such as DOIs is inconsistent. Even when present DOIs are often inaccurate, containing syntactic errors, which prevent their usage.

Ultimately, institutional repositories would be much better if there were a single infrastructure maintained at a national level (or international). In fact, a strong exemplar for this already exists in the form of arXiv. The ability to update the could be devolved to individual institutions. An authentication framework for this is already in place through JE-S.

Linking between institutional repositories and subject repositories unfortunately is likely to be difficult from a social perspective; there are many subject repositories and the institutional repositories are not likely to link to them well, because they are not experts in these repositories. This might be more plausible in a single national repository.

The better solution is to enable authors of papers to perform this linking. Scientists who actually care about the links working and being to the correct place are best place do this. This could be supported in the REF, by making linking to data, software or other subject repositories an explicit criteria in REF; this happens in some disciplines (for example, in bioinformatics a clear statement of if and where software is available and under what conditions is often asked for by reviewers).

Approach to Exceptions

If exceptions are to be for a transitional period, then they any exceptions given should be marked with a “sell-by” date, after which they should no longer be considered valid.

It is worth reiterating that embargoes really only benefit the publishers; ensuring that the REF framework allows academics to choose their publication venue more freely, rather than effectively requiring them to publish in selected “high-impact” venues would enable them to choose venues with short, or no embargo period. The most effective mechanism for achieving this would be to remove all publication venue information from future REF returns. The research would be judged on the basis of the research, and not the publication venue.

Open Data

There is more complexity behind the requirement for open data than for open access, particularly where the data needs to remain confidential for reasons of data protection. Having said all of this, there are many disciplines (again bioinformatics is an obvious example) where the majority of data is open. Making a decision now to rule this out of scope, for a REF which may be a significant distance in future seems premature.

The evil a space can do

Recently, I was contacted by a Kcite user who had found an interesting problem. They had cut-and-paste a DOI from the American Society of Microbiology article [webcite], and then used this in a blog post. But it was not working. The user actually did identify the problem, which was a strange character in the DOI.

So, I decided to investigate a bit futher. Looking at the source for the page, and the DOI appears mostly fine; it is not formatted according to CrossRef display guidelines , but they are hardly alone in this.

 10.1128/​AAC.01664-10 

However, looking a bit further into this at the binary of this source and we see this:

 00006260: 2020 2020 2020 2020 203c 7370 616e 2063 00006280: 3130 2e31 3132 382f e280 8b41 4143 2e30 10.1128/...AAC.0 00006290: 3136 3634 2d31 300a 2020 2020 2020 2020 1664-10.

The character “e2808b” is “zero width space” in UTF-8. The first time I saw this, my initial inclination was to suggest that it is the publishers being a pain and trying to prevent automatic harvesting of DOIs.

Actually, I suspect that this is not the case, as the DOI is in the page metadata:

 

It is also present in multiple other locations, in their social bookmarking widgets. And there it is unmolested by spaces. So, why have they done this? The answer, I think, is that they display their DOI in a widget which is “cleverly” written to appear static on the screen (well, sort of, but this is a different story). And their widget is not wide-enough; the space is non-joining, so it allows them to control where the line break will happen. None the less, this piece of insanity prevents cutting and pasting of the DOI, and worse does so in a way which is very hard to detect for humans at least. To the extent that this kind of error even gets into institutional repositories, which significantly hinder their usefulness . A quick check suggests this is ubiquitous for the American Society of Microbiology website. Consider:

The CrossRef display guidelines are a little bit ambiguous here. Technically, as the zero-width space cannot be seen, it could be considered within the guidelines. I shall write to them to find out.

In case, this article sounds overly pious, I have to raise my hand here in shame, as I have used the same technique for different purposes. An article that I published yesterday on inline citations for kcite uses zero-width joiners to break up a short-code, so that it is displayed rather than interpreted. If the example is cut-and-paste from the article into a new wordpress post, it will not work because of it. I will fix this soon, using unicode entities for the brackets instead.

Update

Thanks to some swift action by Geoff Bilder, CrossRefs display guidelines have now been updated. While it will take a while, the knock-on effects of this change will be significant.