Plain English Summary
There are many database resources which describe biological entities such as proteins, and genes available to the researcher. These are used by both biologists and medics to understand how biological systems work which has implications for many areas. These databases store information of various sorts, called annotation: some of this is highly organised or structured knowledge; some is free text, written in English.
The quantity of this material available means that having a computation method to check the annotation is desirable. The structured knowledge is easier to check because it is organised. The free text knowledge is much harder.
Most methods of analysing free text are based around “normal” English; biological annotation uses a highly specialised form of English, heavily controlled and with many jargon words. In this paper, we exploit this specialised form to infer provenance, to understand when sentences were first added to the database, and how they change over time. By analysing these patterns of provenance, we were able to identify patterns which are indicative of inconsistency or erroneous annotation.