Archive for the ‘Tech’ Category

Well, I am pleased to say that we have now released the new version of kcite. It’s been a while in coming — I had the difficult bit of the code working about 5 months ago, but then got caught up in teaching. Kcite is our bibliography manager which enables citations such as this one (doi:10.1371/journal.pone.0012258) , using DOI or PubMed IDs.

Kcite now uses the marvellous citeproc.js to render the bibliography on the client. The main advantage of this for this release is that the biblography formatting is slightly more regular than before. We’ve also switched to name-author style as the default. There is also a disadvantage which is that the browser has to do lots of Javascript execution client-side; I’ve made efforts to ensure that this is not too onerous; on my desktop, I have been rendering 200-300 item bibliographies, which is much more than most people will use in practice.

In future versions, however, I feel the use of citeproc-js will really come into it’s own. We should be able to enable the user to select their own citation style (currently this is the choice of the authors which makes little sense). We can also add any semantics to the HTML that we choose — CiTO will come properly, for instance. I can also clean up the “unresolved” and “timed out” references. However, first thing on the list is to make the call back for the bibliographic data asychronous. Client-side this should be easy, as we are already using jquery. Server-side requires rewrite rules which I haven’t done before, but I think should not be too hard.

On a separate track, now that I have kcite on what I think is a stable technological footing, I can start to extend in other ways, the most obvious being additional forms of identifiers, critically including WordPress posts with kcite enabled. I’m also pleased that Cross-Ref have recently added the ability to drag metadata in citeproc format (JSON), which means I can skip an integration step.

However, before all of that, we need to restore kblog. We’ve taken the opportunity to move it to a better technological footing, and have started to prepare the new machine that it will be hosted on. This has taken a long time, due to a busy start to the (academic) year. Hopefully, getting hacked is not something we will repeat soon.

The current release of kcite is 1.4.1. This fixes two bugs, one reported by Carl Boettinger (so that now the Javascript only loads when necessary) and another I found which writing this post which made editors appears as authors.

Bibliography

While I am currently spending a significant amount of my time promoting the idea that blog technology can be, and should be used for serious scientific material, I thought I would make a post of a different and perhaps more traditional vein: that is, a light-weight idea, with no serious research behind it, but Years ago now, I created an Energy Wiki full of daft ideas for making energy. I last revisted this in 2009, with an idea for storing energy at sea. I’d actually forgotten that part of the reason for this was to try out Inkscape, which is part of the reason for this post. I wanted to try a bit of multi-media, that is, a blog post with an image in it. High tech.

So, the idea. One form of renewable is the Solar Updraft Tower, also known as a solar chimney. This works straightforwardly enough: you build a large greenhouse in a desert, with a very large chimney in the middle. The top of the chimney is in cold air, the bottom in hot, and an updraft results; stick a turbine in or at the base of the chimney, and you get energy out.

The problem is to work at all efficiently, you need a big temperature differential, so a tall chimney. This in turn means a wide chimney, both to support a substantial updraft, and for mechanical reasons. Tall means 500m or more. The bottom line of this is that a pretty significant capital expenditure is required, followed by a relatively long pay-back period, which in turn means that the biggest single expense of the project is likely to be interest charges, rather than anything else.

So, my idea, is to use an inflatable chimney instead. Initially, I thought about some kind of helium lifting scheme, but then I realised that this makes no sense; why not use hot air, which after all is what the whole system is designed to generate. Consider, for instance, the following organisation:

Inflatable Chimney

Essentially, it’s a traditional balloon with a hole in the middle. Obviously the whole system is stackable — a second balloon could be placed on top of the first and so on. The whole structure could be assembled or dissassembled as desired. Unfortunately, though this would probably take quite a bit of work.

My second thought came from the idea that, while most designs for solar chimneys have the chimney in the middle of the greenhouse, it doesn’t really need to be. A horizontal pipe to the middle would be enough. The chimney could be outside of the greenhouse. The advantage that this brings is that the tower could be raised or lowered in-situ, without the risk of it falling on, and damaging the greenhouse. So my second idea was to build the chimney as a two cylinders, with the gap between the serving as the inflatable, buoyant structure. By pleating the cylinders in opposite directions like so:

Concertina Chimney

the whole structure should concertina up and down. By inflating from the top and deflating from the bottom, it should be possible to raise or lower the entire system by opening and shutting vents at the bottom or top of each section to the inside of the chimney.

One advantage with this system, is that as the chimney gets higher, the temperature differential between the inside and the outside gets greater, which should mean that the taller the tower, the more bouyant the sections get; this should help to keep the entire thing as upright as possible, as will the air travelling through the middle, like some gigantic party blower.

Another addition that cames to mind would be to add inflatable half-toroids around the chimney at regular intervals. With a curve on the top, and a flat bottom-side, the entire thing should operate like an aerofoil, lifting the tower up; so, the windier it gets, the greater the lift, which is just what is needed to keep it as upright as possible. This should mean that the chimney can operate in relatively high wind levels.

This kind of system could even work in concert with a fixed chimney — extending the height by 500m say, and increasing it’s efficiency. It could also act as a supplement — operating only on very hot days when the greenhouse has excess capacity. Or, finally, it could operate while the main chimney was being built, meaning that a plant can start generating income earlier, which should reduce the cost of interest payments.

Of course, this all comes with drawbacks: the ongoing running costs are likely to be a significant; wind will remain a significant factor regardless; and, finally, inflating the tower will using hot air, which will reduce the efficiency of the whole system. Are these flaws significant? Well, as I said, this post is light-weight with no serious research behind it. I have no idea, nor any really clear idea about how to work out these costs. Answers on a postcard please.

I have been pushing the idea of Kblogs — scientific publishing using commodity software — for a year or so know. Our main site, Knowledgeblog.org has got around 100 articles now, and has had about 50k page views (or about 4x the number of raw page hits) and has generated a certain presence on the internet. While this is generally good, the price of fame is that we have moved somewhat up the list of potential hack targets. Unfortunately, this has resulted in two compromises on the machine; they were probably not disconnected, although we have no evidence to link the two at the moment.

The first was through the timthumb zero day vulnerability. It involved a code injection into a WordPress installation using a thumb nail generator with a dodgy bit of PhP in it. We cleaned the system up as well as we are able and went from there. Sadly, a couple of days ago, we had a second break in. This was a more serious and directed attack (the timthumb was scripted, and we were one of several thousands of sites to be hit). In this case, the machine has been root compromised, and the web server used to gather username/passwords in a phishing expedition. We do have backups and all of the content. There were a number of things that we could have done to secure the machine further, at least one of which may have prevented the hack, but there are only so many hours in the day.

So, where does this leave us? Is the whole idea of knowledgeblog broken? Personally, I do not think so. While I have been critical of the cost associated with academic publishing, I am aware that it cannot happen for free. Running and maintaining a web server takes money; it is something that we have been doing on a shoe-string for a while, especially since our JISC money ran out. In the couple of years that we have run knowledgeblog, I think that we have learned and shown a lot. As well as page views and content, we have shown that scientific publishing can be easy for the author; that we can generate attractive articles this way; that we can start to embed computational accessible knowledge into these articles. We have shown that we can do peer-review, if we need. We have shown we can archive and preserve for the future. We have shown that knowledgeblog is good for grey literature. We have added DOIs. Multiple authors. Good looking maths. We even have some preliminary stats on how much publication costs from Word doc to website.

At the moment, though, we do not have a business model. It is clear that if we are to move this forward, it needs to be run as a service, managed, and looked after, something which is neither my expertise or desire. The analogy that I have made earlier with Wikipedia is, I think, a good one; it would be good to move this into a foundation status.

The path from here to there is a long one, however. For the moment, we will restore knowledgeblog, and it will re-emerge, although at this time of year, it will take a while. But we look to the future as well.

In a typically thoughtful post, Peter Sefton discusses the advantages and disadvantages of WordPress as an authoring environment. I though I would clarify my feelings on this a little.

Previously, from our experience on Knowledge Blog suggests to us that the WordPress environment is very poor for editing, something we have expressed in our process documentation.

I should be clear that this is in the context of knowledgeblog. Academics have their own way of working, and normally are used to this. They use tools which fit with their lifes. For example, Google docs is a good tool but, basically, useless if you do most of your paper writing on an plane. The same will be true for tools such as Annotum if it ever appears. It is hard to beat Word and email (or frequently dropbox nowadays).

Of course, there are other ways; for example WordPress offers “A complete revision history of the document is maintained with the ability to roll-back to earlier versions”. But, then, so does Word with dropbox. And the WordPress facilities are in no way comparable to the versioning that you get with latex, or asciidoc and Subversion or Git. Although, in practice, I rarely use versioning when authoring, and dropbox’s poor-mans roll-back is enough.

The only clear advantage of using WordPress tools is that you don’t need a two stage publication process. But, the general idea behind blogs, is that publication does not happen often; it happens once, and then the post remains. This is in contrast to a Wiki, where using external editing tools is impractical at best. And the situation is very similar to current publication where PDF is the common medium.

My conclusion — there are lots of people, lots of use cases, and lots of requirements. I don’t say that authoring must be independent from the publication environment; I do say that publication environment must not require a single authoring tool. Fortunately, for the tools that we have created for kblog, we can afford to be agnostic. They will work integrated with WordPress editing also. Still, I just spent 10 minutes longer making this post than I need to, to stop the shortcodes in Peter’s quote below from being kcite’d (check the source for the trick!), which was harder because I use asciidoc. There are going to be problems. Supporting a heterogenous environment is painful. I wish there were a perfect solution, but there there are just a set of messy compromises.

Peter also makes a second point about our plugins (and others): that is, that they are non-standard.

There are similar issues/risks with stuff like WordPress shortcodes such as KCite from KnowledgeBlogs. It’s a great tool for authors, allowing them to cite things in a rational way:

DOI Example – [cite source=’doi’]10.1021/jf904082b[/cite]

PMID example – [cite source=’pubmed’]17237047[/cite]

But it’s proprietary to a particular processing environment.

There is a risk of creating a new form of the proprietary lock-in we had up until recently (and arguably we still have) with document formats like Microsoft’s .doc.

— Peter Sefton

It’s a fair point, and one which I agree with. The last thing that we need is hundreds of independent shortcode or other syntaxes; I mean, imagine what a nightmare it would be if every single Wiki engine and text conversion tool used their own, almost identical, but slightly different and incompatible syntax. Hmmm.

We chose to use shortcodes for two highly pragmatic reasons. First, WordPress has nice support for them. Building a shortcode handler is nice and simple and does not require us to build regexps (the first version did it by hand for one reason or another, and the regexps were painful). The second reason stems from our desire for a decoupled authoring environment. Shortcodes pass through the HTML publishing step without escaping; to use XML or HTML compliant mechanisms would require us to change, for example, the HTML export mechanism of Word. Not somewhere we wished to go.

In practice, however, I don’t think that this is a major problem, if the code is written carefully. With Mathjax-latex, the shortcodes are transfered into Mathjax syntax, then mathjax does the rest. The development version of kcite works this way — the shortcodes are translated into a span-tag based microformat, then the bibliography tools operate on the client to format the bibliography. So long as the code is crafted reasonable, it should not be dependant on WordPress.

I was delighted recently to discover Greyhole. Essentially, it’s a system that allows you to configure a Samba share at one end, and a bunch of disks at the other. The disks get the data shared between them, with a configurable level of duplication. It’s aimed mainly at the home user, who wants a higher degree of data security than the single drive approach provides, but is not going to go the expensive and poorly scalable RAID approach.

The implementation is fairly straight-forward and elegant. The Samba share is provided by a customised Samba virtual file system. This augments the standard process by logging to a spool region (one file per file operation). A daemon consumes these files, stuffing them into a database, then consumes the entries in the database. Essentially, if anything has changed, greyhole rsyncs the change to one or more of the backend disks.

It’s a really nice system. I must admit that PhP wouldn’t have been my first choice, but that is horses for courses. Likewise, the dependency on Samba is unfortuante — I always found it a pig to configure, besides which I’d like to use this internally on a linux box. I had a discussion with the author Guillaume Boudreau, who confirmed my initial feeling that the Samba VFS could be easily replaced with another, such as FUSE. I’d like to have a go at doing this work, and it’s very possible — basically, it requires a big merge between Guillaumes VFS and the FUSE based loggedfs. If I had written any C, I could probably do it in a day or so, but as it stands, it is likely to take longer.

As well as home usage, though, this could also be good for the researcher. While a small lab could pay for managed storage, this tends to come in at £1000 per TB, per annum. Most labs don’t need 24/7 recovery though, and the data is often write once, read occasionally. Greyhole would work out for 1TB at 200 quid (for a low-wattage PC server), 100 quid two 1TB discs which would cost, say, 40 quid to power for a year (say, 15W for the computer, 10W for the hard drives, and a bit more for networking, adaptors, USB hubs and so). For lab usage, the drives would probably last 2-3 years at least, while an all solid state computer might last twice this long. More storage space could be added as needed, dropping the cost per TB substantially, although how scalable greyhole is I don’t know.

The general approach could be used more widely, though. As well as JBOD spanning, what about:

Blackhole

The lab runs a local disc for their own data access needs, which is backed up to a institutional data store somewhere off-site. The daemon could be configured to use late night bandwidth, which would only compromise data security slightly.

Whitehole

More in line with my style of science, the local disc would be backed up to a public accessible repository. Obviously this would require suitable metadata to describe the status of the data, but everything would be sharable and accessible as it was produced.

Wormhole

Many labs collaborate with one or two others. A wormhole file system would be configured so that data placed on my file share would magically appear, read-only, in one or more places on the internet, using a rsync/ssh pipe. My collaborators data would, likewise, appear on my disc.

Plughole

This would replicate the normal scientific “supplementary data” process for releasing data publically. Essentially, everything on the file system would, after a significant period, be converted into an excel spreadsheet with no column titles or any additional metadata. This would then be placed in a web accessible location for between 2-6 months, before being randomly deleted.

I’m buying a low power consumption PC to try out greyhole in it’s current form, to see how it goes.

I’ve just got around to installing the magnificient kcite plugin that Simon Cockell wrote for knowledgeblog. It’s actually a really simple plugin, but it’s tremedously useful. For instance, I can now cite my own papers on reality (doi:10.1371/journal.pone.0012258) , function (doi:10.1186/2041-1480-1-S1-S4) or protein classification (doi:10.1093/bioinformatics/btl208) and all the metadata will be gathered and cited for me in a nice reference list at the end.

Of course, I am used to the good life, and this is still all a bit clunky for me. I wanted support from my text editor. For this blog, I use a tool-chain of Emacs, asciidoc and blogpost. But for references I use reftex mode and bibtex. Now I realise that this is a pretty minority tool-chain, but it seemed to me that it should be possible to get it working. And it is, actually, pretty easy. Very rough and ready, but the lisp is below. Obviously, this will need fiddling with for each user, and I will improve it over time.

But it demonstrates the point, I think. A little bit of glue can produce a pretty good publishing tool chain, relatively quickly.

(add-hook 'adoc-mode-hook
          'phil-asciidoc-reftex-support)

(defvar phil-reftex-citation-override nil)

(defun phil-asciidoc-reftex-support()
  (reftex-mode 1)
  (make-local-variable 'phil-reftex-citation-override)
  (setq phil-reftex-citation-override t)
  (make-local-variable 'reftex-default-bibliography)
  (setq reftex-default-bibliography
        '("~/documents/bibtex/phil_lord_refs.bib"
          "~/documents/bibtex/phil_lord/journal_papers.bib"
          "~/documents/bibtex/phil_lord/conference_papers.bib"
          )))

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
  (if phil-reftex-citation-override
      (progn
        (setq ad-return-value (phil-reftex-format-citation entry format)))
    ad-do-it))

(defun phil-reftex-format-citation( entry format )
  (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[(doi:)
%s[/cite\\]]" doi)))

Bibliography