Archive for the ‘Professional’ Category

Tawny-OWL (http://www.russet.org.uk/blog/2214) is a library which enables the programmatic construction of OWL (http://www.russet.org.uk/blog/2366). One of the limitations with tawny as it stands is that it did not implement numeric, semantics free identifiers (http://www.russet.org.uk/blog/2040); tawny builds identifiers from the clojure symbols used to describe the class. So, in my pizza ontology, for instance, PizzaTopping gets an iri ending in PizzaTopping. Semantics free identifiers have some significant advantages; the principle one is that the establish an identity for an object which can persist even if the properties (the labels for instance) change, as I have described previously (http://www.russet.org.uk/blog/1908).

However, semantics-free identifiers do not come for free; they also have significant disadvantages, mainly that they make the life of developers harder and code less readable (http://www.russet.org.uk/blog/2040). I’ve previously suggested solutions to this problem when it afflicts OWL Manchester syntax (http://www.russet.org.uk/blog/1470).

With tawny, the IRIs that are used to identify concepts can easily be separated from the clojure symbols that are used to identify them; the initial link between them was simply one of convienience. So supporting numeric IRIs was possible with very little adjustment of the core owl.clj required one fixed function call to become a call to a first-class function.

One of purposes of tawny is to enable to a more agile development methodology than we have at present, so clearly I did not want the developer to have to manage this process by hand. Moreover, as recent discussions on the OBI mailing list, the issue of co-ordination of identifiers can be a significant difficult. As James Malone has recently described, there the URIgen tool offers a solution to this problem (http://jamesmaloneebi.blogspot.co.uk/2013/04/keeping-it-agile-secret-to-fitter.html). Simon Jupp who is the primary developer of URIgen kindly discussed the details with me, which has helped me form my ideas about a suitable workflow, and I have borrowed heavily from URIgen (and the protege plugin) for this. While I will probably implement a URIgen client for tawny in the future, my initial approach uses a slightly different idea. In general, with tawny, I have been advocating using standard software development tools, instead of specific ontology ones (http://www.russet.org.uk/blog/2366); rather than co-ordinating developers through the use of a centralised server, it seems to me to make more sense to use whatever version control system. To that end, I have implemented a file based system for storing identifiers; given that most bio-ontologies remain under the 50,000 terms size, I think that this is plausible, especially as it is simply in tawny to modularise the source (if not the ontology which remains a hard research problem). In this case, I have used a properties files, since it is a simple and human-readable format.

This works as follows. First, we define a new ontology, with an iri-gen frame, which use the obo-iri-generate function. Of course, this is generic so it is possible to use arbitrary strategies for generating an IRI.

(defontology pizzaontology
  :iri "http://www.ncl.ac.uk/pizza-obo"
  :prefix "piz:"
  :comment "An example pizza using OBO style ids"
  :versioninfo "Unreleased Version"
  :annotation (seealso "Manchester Version")
  :iri-gen tawny.obo/obo-iri-generate
  )

Next, we need to restore the mapping between names and IRIs. We need to do this before we create any classes. In the first instance, this file will be empty, and will contain no mappings; this is not problematic.

(tawny.obo/obo-restore-iri "./src/tawny/obo/pizza/pizza_iri.props")

Now, we define concepts, properties and so forth as normal.

(defclass CheeseTopping
  :label "Cheese Topping")
(defclass MeatTopping
  :label "Meat Topping")

The difference in how the IRI is created should be transparent to the developer at this point. Behind the scenes were are using this logic.

(defn obo-iri-generate-or-retrieve
  [name remembered current]
  (or (get remembered name)
      (get current name)
      (str obo-pre-iri "#"
           (java.util.UUID/randomUUID))))

Or, in English: if the name (“CheeseTopping”) has been stored in our properties file, use this IRI; or if the name has already been used in the current session use this IRI, failing that, create a random UUID. I have used a UUID rather than autominting new identifiers because tawny is programmatic; it is very easy to create 1000 concepts where you meant to create 10 which would result in a lot of new identifiers. It makes more sense to mint permanent identifiers explicitly, as part of a release process.

This also works for programmatic use of tawny, regardless of whether concepts are added to the local namespace. This code creates many classes all at once, but does not add them to the namespace. Their IDs will still be stored.

(doseq [n (map #(str "n" %) (range 1 20))]
  (owlclass n)
   )

Finally, we need to store the IRIs we have created. Both full IDs and UUIDs are stored; so new classes will get a random UUID, but it will persist over time, providing some interoperability with external users who can use the short-term identifier in the knowledge that it may change.

(tawny.obo/obo-store-iri "./src/tawny/obo/pizza/pizza_iri.props")

At the same time, we report obsolete terms. These are those with permanent identifers, which are present in the properties file, but have not been created in the current file. Currently, these are just printed to screen, but I could generate classes and place them under an “obsolete” superclass.

(tawny.obo/obo-report-obsolete)

Finally, at release point, a single function is called to generate the new IDs. This is done numerically, starting from the largest ID. If there are multiple developers, this step has to be co-ordinated, or it is going to break; but this is little different from a release point of any software project.

(tawny.obo/obo-generate-permanent-iri "./src/tawny/obo/pizza/pizza_iri.props" "http://www.ncl.ac.uk/pizza-obo/PIZZA_")

I think this workflow makes sense, but only use in practice will show for sure. If the requirement for co-ordination over minting of real IDs is problematic, then URIgen would provide a nice solution. I can also see problems with my use of props files; I have sorted them numerically which makes them easier to read (and predicatably ordered), but this has the disadvantage that changes are likely to happen near the end, which is likely to result in conflicts. While these would be relatively simple conflicts, merging is necessarily painful. This could be avoiding by storing permanent IDs in one file, and UUIDs in per-developer files.

This is the last feature I am planning to add to the current iteration of tawny; I want to complete the documentation for all functions (this has already been done for owl.clj, but not the other namespaces), and the tutorial. For the 0.12 cycle, I plan to make tawny complete for OWL2 (basically, this means adding datatypes).

This articles describes a SNAPSHOT of tawny, available on github (https://github.com/phillord/tawny-owl). All the examples shown here, come from (yet another!) version of the pizza ontology, also available on github (https://github.com/phillord/tawny-obo-pizza).

Bibliography


Abstract

Semantic publishing can enable richer documents with clearer, computationally interpretable properties. For this vision to become reality, however, authors must benefit from this process, so that they are incentivised to add these semantics. Moreover, the publication process that generates final content must allow and enable this semantic content. Here we focus on author-led or "grey" literature, which uses a convenient and simple publication pipeline. We describe how we have used metadata in articles to enable richer referencing of these articles and how we have customised the addition of these semantics to articles. Finally, we describe how we use the same semantics to aid in digital preservation and non-repudiability of research articles.

  • Phillip Lord
  • Lindsay Marshall

Plain English Summary

Academic literature makes heavy of references; effectively links to other, previous work that supports, or contradicts the current work. This referencing is still largely textual, rather than using a hyperlink as is common on the web. As well as being time consuming for the author, it also difficult to extract the references computationally, as the references are formatted in many different ways.

Previously, we have described a system which works with identifiers such as ArXiv IDs (used to reference this article above!), PubMed IDs and DOIs. With this system, called kcite, the author supplies the ID, and kcite generates the reference list, leaving the ID underneath which is easy to extract computationally. The data used to generate the reference comes from specialised bibliographic servers.

In this paper, we describe two new systems. The first, called Greycite, provides similiar bibliographic data for any URL; it is extracted from the URL itself, using a wide variety of markup and some ad-hoc tricks, which the paper describes. As a result it works on many web pages (we predict about 1% of the total web, or a much higher percentage of “interesting” websites). Our second system, kblog-metadata, provides a flexible system for generating this data. Finally, we discuss ways in which the same metadata can be used for digitial preservation, by helping to track articles as and when they move across the web.

This paper was first written for the Sepublica 2013 workshop.

Our original intention with Greycite (http://greycite.knowledgeblog.org) was to build a tool which can provide bibliographic metadata for any URL, to support my own kcite referencing tool (http://knowledgeblog.org/kcite-plugin). While it still fulfils this function, it also turns out to be a useful, general-purpose tool for investigating the metadata in various web pages. And this reveals some interesting results. We discovered a nice example of this recently while adding RIS support.

The paper in question comes from EMBO reports (10.1038/embor.2013.11). At first sight, the RIS for this page taken from Greycite looks reasonable.

TY - ELEC
UR - http://www.nature.com/embor/journal/v14/n3/full/embor201311a.html
Y2 - 2013-04-18 12:54:54
TI - The economics of creative research
JO - EMBO reports
PY - 2012
DA - 2012-02-08
DO - 10.1038/embor.2013.11
AU - Cou|[eacute]|e, Ivan
ER -

However, something strange is going on with the author; poor old Ivan Couée’s name has been rather broken. So, why is this happening? Looking at the underlying HTML the first thing that hits you is a lot of space; there are over 50 empty lines at the beginning of the file; still, this is only a problem for people strange enough to be reading the HTML.

However, eventually we get to the metadata, first dublin core and what we describe as Google Scholar (since this is where we found it). And there we have it; greycite is reporting the metadata as it is. The author’s name is represented with |[eacute]| as a letter.

<meta name="dc.language" content="en" />
<meta name="dc.rights" content="&#169; 2012 Nature Publishing Group" />
<meta name="dc.title" content="The economics of creative research" />
<meta name="dc.creator" content="Ivan Cou|[eacute]|e" />
<meta name="dc.identifier" content="doi:10.1038/embor.2013.11" />
<meta name="dc.date" content="2012-02-08" />

<meta name="citation_publisher" content="Nature Publishing Group" />
<meta name="citation_authors" content="Ivan Cou|[eacute]|e" />
<meta name="citation_title" content="The economics of creative research" />
<meta name="citation_date" content="2012-02-08" />
<meta name="citation_volume" content="14" />
<meta name="citation_issue" content="3" />
<meta name="citation_firstpage" content="222" />
<meta name="citation_doi" content="doi:10.1038/embor.2013.11" />
<meta name="citation_journal_title" content="EMBO reports" />

As far as we can tell this is an error; HTML attributes or extended character sets are entirely valid, but |[eacute]| does not appear to be a valid representation. Interestingly enough, there also appears to be some slightly buggy code in the PRISM metadata, which I am sure should not be this.

<meta name="prism.issn" content="ERROR! NO ISSN" />
<meta name="prism.eIssn" content="ERROR! NO EISSN" />

My guess is that the problem is at the point of website generation rather than deeper in the bowels of the publishing system; grabbing the metadata for this article from CrossRef by content negotiation (http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html) shows the correct name.

{"volume":"14",
 "issue":"3",
 "DOI":"10.1038/embor.2013.11",
 "URL":"http://dx.doi.org/10.1038/embor.2013.11","title":"The economics of
        creative research",
 "container-title":"EMBO reports",
 "publisher":"Nature Publishing Group",
 "issued":{"date-parts":[[2012,2,8]]},
 "author":[{"family":"Couée","given":"Ivan"}],
 "editor":[],"page":"222-225",
 "type":"article-journal"}

We emphathise with the publishers here. Getting character sets correct is the bane of everyones life; given the state of computing when multi-lingual character sets appeared, we guess it is not an example of premature optimisation, but an example of an optimisation you wish had never happened. The world would be an easier place if everything that used unicode from the start.

The current metadata for this paper can be seen on greycite or in detail. Hopefully, this will be updated in time!

This post was written by Phillip Lord and Lindsay Marshall

Update

Spelling mistake corrected, bibliography added. Thanks to Christian Perfect for bug report.

Bibliography

Much has been said about overlay journals (http://gowers.wordpress.com/2013/01/16/why-ive-also-joined-the-good-guys/). The idea is simple; the journal essentially becomes a selector, a channel, with the paper itself being hosted elsewhere, such as arXiv.

This holds a certain amount of attraction for me; I already post my new papers on arXiv. I have been posting them here also (http://www.russet.org.uk/blog/1713). This works well, but is hampered by technology. Mostly I write papers in LaTeX, and I have written tools to make these suitable for WordPress (http://www.russet.org.uk/blog/1740); these work well enough to publish an entire thesis (http://themindwobbles.wordpress.com/2013/01/02/phd-thesis-table-of-contents/). However, the process of doing this is not slick (http://themindwobbles.wordpress.com/2012/06/14/converting-a-latex-thesis-to-multiple-wordpress-posts/). For instance, when trying to publish one of my own papers, I have had problems as I used a theorem environment (10.1186/2041-1480-1-S1-S4). While PlasTeX is a nice tool, the key problem is that it is fundamentally a different interpreter from TeX. Eventually, perhaps, LuaTeX will get an HTML backend, but until this happens the system will always fail in some cases.

So, I wanted to investigate whether it was possible to build Overlay functionality into a personal publication framework, such as the WordPress installation I host these articles on. Well, it turns out combined with the tools that I have written for manipulating metadata (http://knowledgeblog.org/kblog-metadata), it is relatively simple to do so; my first attempt at this is now available for my OWLED 2013 paper (http://www.russet.org.uk/blog/2366). The title, authors (just me in this case), date, abstract and PDF link all come directly from arXiv. Full text is not available from arXiv — anyway it would suffer from all the issues described earlier; in the end, the PDF is probably the best representation of this paper. I have supplemented this with a plain English summary, something that I have wanted to do for years, but have not managed to start. If the reviewers will allow me to do so, I will also attach these when they become available.

The code for this is not quite ready to release yet: however, it will potentially work over any eprints repository, and I have connected it up to Greycite also (http://greycite.knowledgeblog.org), so it can be used over any source that greycite can interpret.

All a little clunky, but I think that this is the future. The Journal is dead, Long Live the article.

Update

Fixed DOI.

Bibliography


Abstract

The Tawny-OWL library provides a fully-programmatic environment for ontology building; it enables the use of a rich set of tools for ontology development, by recasting development as a form of programming. It is built in Clojure - a modern Lisp dialect, and is backed by the OWL API. Used simply, it has a similar syntax to OWL Manchester syntax, but it provides arbitrary extensibility and abstraction. It builds on existing facilities for Clojure, which provides a rich and modern programming tool chain, for versioning, distributed development, build, testing and continuous integration. In this paper, we describe the library, this environment and the its potential implications for the ontology development process.

  • Phillip Lord

Plain English Summary

In this paper, I describe some new software, called Tawny-OWL, that addresses the issue of building ontologies. An ontology is a formal hierarchy, which can be used to describe different parts of the world, including biology which is my main interest.

Building ontologies in any form is hard, but many ontologies are repetitive, having many similar terms. Current ontology building tools tend to require a significant amount of manual intervention. Rather than look to creating new tools, Tawny-OWL is a library written in full programming language, which helps to redefine the problem of ontology building to one of programming. Instead of building new ontology tools, the hope is that Tawny-OWL will enable ontology builders to just use existing tools that are designed for general purpose programming. As there are many more people involved in general programming, many tools already exist and are very advanced.

This is the first paper on the topic, although it has been discussed before here.

This paper was written for the OWLED workshop in 2013.


Reviews

Reviews are posted here with the kind permission of the reviewers. Reviewers are identified or remain anonymous at their option. Copyright of the review remains with the reviewer and is not subject to the overall blog license.

Review 1

The given paper is a solid presentation of a system for supporting the development of ontologies – and therefore not really a scientific/research paper.

It describes Tawny OWL in a sufficiently comprehensive and detailed fashion to understand both the rationale behind as well as the functioning of that system. The text itself is well written and also well structured. Further, the combination of the descriptive text in conjunction with the given (code) examples make the different functionality highlights of Tawny OWL very easy to grasp and appraise.

As another big plus of this paper, I see the availability of all source code which supports the fact that the system is indeed actually available – instead of being just another description of a “hidden” research system.

The possibility to integrate Tawny OWL in a common (programming) environment, the abstraction level support, the modularity and the testing “framework” along with its straightforward syntax make it indeed very appealing and sophisticated.

But the just said comes with a little warning: My above judgment (especially the last comment) are highly biased by the fact that I am also a software developer. And thus I do not know how much the above would apply to non-programmers as well.

And along with the above warning, I actually see a (more global) problem with the proposed approach to ontology development: The mentioned “waterfall methodologies” are still most often used for creating ontologies (at least in the field of biomedical ontologies) and thus I wonder how much programmatic approaches, as implemented by Tawny OWL, will be adapted in the future. Or in which way they might get somehow integrated in those methodologies.

In this article, I consider the practical issues with archiving of scientific material placed on the web; I will describe the motivation for doing this, the background and consider the various mechanisms for doing so.

As part of our work on knowledgeblog (http://knowledgeblog.org), we have been investigating ways of injecting the formal technical aspects of the scientific publication process into this form of publication. The reasons for this are myriad: if the scientist can control the form, they can innovate in their presentation how they choose; the publication process itself becomes very simple and straight-forward (as opposed to the authoring, which is as hard as it ever way). Finally, it means that scientists can publish as they go, as I have done and am doing on my work with Tawny-OWL. This latter point has many potential implications: firstly, it makes science much more interactive — scientists can publish things that they are not clear on, early ideas and can (and I do) get feedback on this early; secondly, it should help to overcome publication bias as it is much lighter-weight than the current publication process. Scientists are more likely to publish negative results if the process is easy and not expensive. And, lastly, it can help to establish provenance for the work; if every scientists published in this way, scientific fraud would be much harder, as a fraudulant scientist would have to produce a coherent, faked set of data from the early days of the work.

However to achieve this, posts must still be available. The scientific record needs to be maintained. Now, this should not be an issue. I write this blog in Asciidoc (http://process.knowledgeblog.org/167), and rarely use images, so the source is quite small. In fact, since I moved to WordPress in 2009 (http://www.russet.org.uk/blog/1175), it totals about 725k; so it would fit on a floppy, which is a crushing blow to my ego. So, how easy is it to archive your content?

The difficulty here is that there is no obvious person to do this. Like many universities, I have access to an eprints archive. Unfortunately, this is mainly used for REF, and has no programmatic interface. The university also has a LOCKKS box. However, this is not generally available for the work that the University staff has produced, but journals that the University has bought; so I have to give my work away to a paywall publisher, or pay lots to an open access publisher to access this.

Another possibility would be to use Figshare. Now, I have some qualms about Figshare anyway; it appears to be a walled garden, the Facebook of science. Others, however, do not worry about this and are using Figshare. Carl Boettiger (http://www.carlboettiger.info/), for instance archives his note book on Figshare. But there is a problem: consider the 2012 archive; it is a tarball, with Markdown files inside; I know what to do with this, but many people will not. And it is only weakly linked to the original publication link. Titus Brown had the same idea (http://ivory.idyll.org/blog/posting-blog-entries-to-figshare.html), and claiming the added value of DOIs, something I find dubious (http://www.russet.org.uk/blog/1849). Again, though, the same problem; Figshare archives the source. The most extreme example of this comes from Karthik Ram who has published an Git repository; unsurprisingly, it is impossible to interact with as a repo.

Figshare likes to make great play of the fact that it is backed by CLOCKKS — this is a set of distributed copies maintained by some research libraries. Now, it might seem sensible that CLOCKKS would offer this service (at a price, of course) to researchers. Perhaps they do. But the website reveals nothing about this. And, although, I tried they did not respond to emails either. Rather like DOIs, the infrastructure is build around scale; in short, you need a publisher or some other institution involved; all very well, but this contradicts the desire for a light-weight publication mechanism. There is a second problem with CLOCKKS; it is a dark archive, that is, its content only becomes available to the public after a “trigger event”; the publisher going bust, the website going down and so on. Now data which is on the web and, critically, archived by someone other than the author essentially becomes non-repudiable and time-stamped. I can prove (to a lower-bound) when I said something. And you can prove that I said something even if I wish I hadn’t. In a strict sense, this is true if the data is in CLOCKKS; but in a practical sense, it is not, as checking when and what I said becomes too much of a burden to be useful.

So, we move onto web archiving. The idea of web archiving is attractive to me for one main reason; it is not designed for science. It is a general purpose, commodity solution, rather like a blog engine. If one thing scientific publication needs more than anything, it is to move the technology base away from bespoke and toward commodity.

One of the most straight-forward solutions for web archiving is WebCite; the critical advantage that this has is that it provides an on-demand service. I have been using it for a while to archive this site; greycite (http://greycite.knowledgeblog.org) now routinely submits new items here, if we can extract enough metadata from them. The archiving is quick, rapid and effective. The fly-in-the-ointment is that WebCite has funding issues and is threatened with closure at the end of 2013. The irony is that it claims it needs $25,000 to continue. Set against the millions put aside for APCs (http://www.rcuk.ac.uk/media/news/2012news/Pages/121108.aspx), or the thousands NPG claims is necessary to publish a single paper (http://www.guardian.co.uk/science/2012/jun/08/open-access-research-inevitable-nature-editor), or the millions that ACM spends supporting its digital library (http://www.russet.org.uk/blog/1924), this is small beer, and it shows the lack of seriousness with which we take web archiving. I hope it survives; if it does, Gunther Eysenbach, who runs it, tells me that the plan to expand the services they offer. It may yet become the archiving option of choice.

I have been able to find no on-demand alternative to WebCite. However, there are several other archives available. I have been using the UK Web Archive for a while now. I first heard about this service, irony or ironies, on the radio. Since I first used it to archive knowledge blog and later used it to archive this site, the process has got a lot easier. No longer do I need to send signed physical copyright permission; first it was electronic (email I think). It now appears that the law is changing to allow them to archive more widely (the BBC covered this in a story, categorized under “entertainment and arts” and which is largely focused on Stephen Fry’s tweets), although this will be a dark archive. Currently, this journal has been archived only once; from my other sites, it appears that they have a six month cycle. So, while this provides good digital preservation, it is a less good solution from the perspective of non-repudiablility; there is a significant gap before the archive happens, and a slightly longer one till the archive is published.

The UKWA is, as the name suggests, is UK specific. Another solution is to use, of course, archive.org, which might be considered to be the elephant in the room for web archiving. Unlike the UKWA, they don’t take submissions, but just crawl the web (although I suspect that the UKWA will start doing this also now). Getting onto a crawl can, therefore, be rather hit-and-miss. Frustratingly, they do have an “upload my data” service, which you can access through a logged in account; but not an “archive my URL” service. Again, a very effective resource from a digital preservation resource, but with similar problems to the UKWA from a point-of-view of non-repudiablilty. The archives take time to appear; in my experience, somewhat longer than the UKWA. I have also contacted their commercial wing, http://archive-it.org. Their software and the crawls that the offer could easily be configured to do the job, but unfortunately, they are currently aimed very much at the institutional level: their smallest package provides over around 100Gb of storage; this blog can be archived in around 130Mb (this is without deduplication which would save a lot); even a fairly prolific blogger comes in at around 250Mb. The price, unfortunately, reflects this. Although, again, it is on a par with my yearly publication costs, so is well within an average research budget.

Of course, these solutions are not exclusive; with greycite we have started to add tools to support these options. For instance, kblog-metadata (http://knowledgeblog.org/kblog-metadata), now supports an “archives” widget which is in use on this page; this links directly through to all the archives we know about. For individual pages, these are deep links, so you can see archived versions of each article straight-forwardly. The data comes from greycite, which we discover by probing; we may move later to using Mementos. greycite itself archives metadata about webpages, so we link to this also. As a side effect, these also mean that each article is submitted to greycite, which in turn causes archiving of the page through WebCite. Likewise, archive locations are returned within the BibTeX downloads, which is useful for those referencing sites.

Finally, greycite now generates pURLs — these are two-step resolution URLs which work rather like DOIs (or actually DOIs operate like pURLs, since as far as I am aware, pURLs predate the web infrastructure for DOIs). These resolve directly to the website in question. With a little support greycite can track content as and if it moves around the web; even if this fails, and an article disappears, greycite will redirect to the nearest web archive.

In summary, there is no perfect solution available at the moment, but there are many options; in many cases, archiving will happen somewhat magically. As we have found with many other aspects of author self-publishing on the web, it is possible to architecturally replicate many of the guarantees provided by the scientific publication industry through the simple use of web technology. Tools like greycite and kblog-metadata are useful in uncovering the archives that are already there, and linking these together with pURLs. Taken together, I have a reasonable degree of confidence that this material will be available in 10 or 50 years time. Whether anyone will still be reading it, well, that is a different issue entirely.

Bibliography