An Exercise in Irrelevance - Identifiers for Science

While it’s not a major problem, the inability to uniquely and reliably identifier a particular scientist is a niggle; a few years ago, I was distressed to find that I was scheduled to give a talk at an eScience conference about security; anyone who knows me, will understand how implausible this was. I hadn’t considered the possibility that there was another Phillip Lord in eScience. It’s not that common a name.

So, what would we want form such a ID system? I’ve think that the basic requirements would be:

the IDs should be unique; one ID only ever refers to one scientist.
the reverse should also be true; one scientist should not need to change their ID.
the ID should be printable, so that it can appear in papers.
the ID should be usable with a resolution system.

I think that this is it. I would say, also, that there are some softer requirements. Firstly, I think that the IDs should be useful to the scientist (above and beyond being able to link all their papers are research results); this would give them more immediate feedback, so that they would find the system to be a good thing, rather than a burden. Secondly, the system should be familiar and easy to use. Finally, as an anti-requirement, the system need not be secure; that is, it would be possible for someone to pretend to be me; this is not to say we couldn’t layer a secure identification system on top of the IDs.

So I thought about what form the ID would take. My first thought was just to layer the system on top of a first name, surname of the scientist. This has the big advantage, of course, that it makes the system easy to use; scientists already know their own names (mostly) and so does everyone else. People will remember the IDs easily. The problem is, of course, that peoples’ names are not fixed; women, particularly, are likely to change their names, and once the link between name and identifier is broken the advantage is lost.

My second thought was that we could use identifiers chosen by the scientist; this is not a bad idea; of course, it’s harder for humans to link between the ID and (other) scientists, but in time you would come to know IDs for most of the people in your domain. However, this form of identifer is also likely to become broken over time: firstly, many scientists will just want to choose their names, so we have the same problem as before; secondly, some scientists will just want to change their IDs — while peanutbutter or DullHunk might work now, it is possible that the owners of these names will come to regret them like the “Phil loves Newcastle United” tatoo that I don’t have on my forehead.

In the end, I’ve come to the conclusion that only a semantics free identifier actually makes any sense. This is clear the least memorable route, but even here it’s not too bad; I know my NI number by heart because I use it a lot (or used it a lot at one point in time). In practice, most scientists read stuff on the web, so this could be resolved to show the full name automatically; in most cases, with papers for instance, it would be augmented with a standard name anyway.

So, what form of ID do we want? Well, the simplest form would be a six-letter code. This gives 300,000,000 alternatives; if we add in numbers this rises to a litle over 2 billion. Probably more than enough for scientists now and into the future. The system could be extended if the name space ran out. However, I think we could improve the system by adding an extra letter to make 7; this would now mean that we could ensure that no two scientists had a ID with only a single edit difference; essentially, one letter would be redundant. Finally, we could add a final letter to make a checksum — basically, treat the letters as base 26, multiple them, divide by 26, take the remainder and use this as the last letter. This would allow an easy validation step. Finally, we might want to do a dictionary passed block on some names; pity the poor scientist who ended up as NOBRAINS or other far worse 8 letter IDs.

As it stands, I don’t think that this would place too much load on scientists, but it would also not appeal to people; the big win would come when they would use these IDs to make their daily life easier. This could be achieved by sticking an authentication protocol, OpenID being the obvious one, although the IDs are generic enough that any authentication system could be stuffed on the end; as the IDs are not going to change over the life of a scientist this should reduce the management load of yet another identifier. Potentially, we could login to eduroam, various academic tools, wiki’s and the like all with a single ID. At the last RIN/DCC meeting, many people argued that they need username/password registration; I suggested that this was a significant pain and barrier to reuse; this is true, but the barrier gets a lot less if the registration process either disappears or every scientist gets to reuse the same ID.

Technologically, I don’t think that this would take a lot of effort to set up. Socially, the demands would be huge; for it to work, the basic technology is not enough; we would need to put in infrastructure to make sure key tools supported the system; JeS and Shibboleth would be obvious first points of contact; adding an OpenID provider would support less formal resources (such as project Wikis), but collaborating with Wikipedia and paying them to add support would help.

In some sense, I look forward to the day that I cease to be Phil Lord and become ADSJWOSK.