Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams
by Stephen Hewitt | Published | Last updated
It is possible to create usable pseudowords with more than 3 bits of information per letter using a trigram letter model of English. Software demonstrating this is presented here under a GPL free software licence. The first twenty pseudowords generated when the software is run are shown in Table 1. This list is representative and words on it have not been subjectively selected.
As discussed in previous articles, such pseudowords have potential as:
- pronounceable passwords of quantifiable strength
- parts of a globally unique name that can prove itself to belong to a particular cryptographic public key
- a more usable representation of a public key fingerprint than hexadecimal.
|Pseudoword||Letters||Entropy (bits)||Bits per letter|
The trigram model and information content
A trigram model means that as the pseudoword is constructed the choice for the next letter depends on the previous two letters. The model is trained using a corpus or dictionary of words. For each ordered pair of previous two letters the model counts the occurrence of the third letter at every position in every word in the training set. Later, when producing a pseudoword, it mimics this occurrence probability of the third letter.
The implementation here uses 26 letters plus one additional symbol (represented here as ‘0’) to represent a word boundary. It represents either a character before the start of the word or a character after the end of the word and it is clear from the context which, because trigrams here do not span words.
So for example, during training the trigrams extracted and counted from the word ‘cabbage’ would be: 00c, 0ca, cab, abb, bba, bag age, ge0
Suppose during training the counts accumulated for the letters following ‘ab’ are:
‘0’=51,‘a’=153,‘b’=143,‘c’=0,‘d’=33... and so on.
Then the word ‘cabbage’ in the training set would have contributed 1 to towards the count of ‘b’=143
Let T be the total of the counts of all possible letters following ‘ab’. Then in this example
T = 51 + 153 + 143 + 0 + 33 ...
These counts mean that whenever the two previous letters in a pseudoword are ‘ab’ then the probability of ‘b’ for the next letter should be 143/T.
Therefore according to Shannon's information theory the information gained in this context by discovering that the next letter is ‘b’ is log2(T/143) bits. The entropy values in Table 1 were derived on this principle.
In the experimental software here, these bits of information that determine the next letter in the pseudoword are obtained by calling a random number generator and so generating random pseudowords, but in a real application like those listed above, an arithmetic or other encoder would instead take bits from the binary string being encoded. For each word in Table 1 the entropy value shows how many bits from the string would be encoded in this way by an optimal encoder.
The value of entropy per letter in the table is the total entropy of the word divided by the number of letters. From a human usability point of view this gives an estimate of how many letters are needed to convey n bits of information.
From an encoding point of view it is not the whole story because some of the information carried by the pseudoword comes from where the it ends and is not associated with any particular letter. The information that the word has ended is not obtained by reading the last letter of the word, but only by trying to read beyond it. In effect there is an end-of-word marker after the last letter, so a pseudoword of n letters has n + 1 symbols.
In concrete terms, if a long binary string were to be encoded by a sequence of readable pseudowords there would be spaces between the words. These spaces would also be necessary information because it is not possible in general to tell where one of these pseudowords ends and the next starts without them. So a naive interpretation of the value for bits per letter would overestimate the coding efficiency.
Pure trigrams plus a length heuristic
The model used here has been extended from the pure trigram model just described, in order to produce pseudowords of more predictable length. The extension consists of faking the count of the end of word (EOW) symbol being used by the generator.
To avoid very short words, for the first four characters the pseudoword generator pretends that every count for an EOW symbol in its model is zero.
To encourage words not to be very long, once six characters have been generated the generator pretends that the every count for an EOW symbol is twice what it actually was in the training set. And with each letter that is still added to the pseudoword, it keeps doubling the factor by which it inflates every EOW count.
Before the introduction of this heuristic the generator produced for example yerferbinecinceptiantpatestabionings (119.97 bits) and suggelingindefersubia (73.13 bits).
There is scope for optimisation of this heuristic. Experiments on it have not been performed. For example it could:
- allow more letters before it starts to increase the probability of the word ending
- cut off less sharply
- be based on syllables rather than letters
- be based on the entropy of the word so far rather than letters
Comparison with a more primitive model
Table 2 shows twenty randomly-selected pseudowords generated by the simple model presented in Part 2 using the same training set as the trigram model in Table 1. Described in detail in Part 2, essentially it generates pseudowords with alternating segments of consonants and vowels extracted from the training set, with every permutation of the segments equally likely. It can be tuned by selecting the n most frequent segments for each of its segment categories. With the tuning used here there were a total of 598,400,000 possible permutations making the entropy of every word log2(598400000) or ~29.16 bits.
|Pseudoword||Letters||Entropy (bits)||Bits per letter|
GPL free software to download
Here is the experimental software that produced the pseudowords in Table 1. It consists of the following files with the following SHA256 hashes at the following locations on this website.
c0f6d92a8b91eec0f692fae2c6400fba63f8980b51f0322aa70ff904412ab194 /download/train_trigram.cc 2e8bf16d0e0e013a8dcf3f2a3f95b322ea9534e86a6c7981cd49cfc6b1001a2e /download/trichars.gh f53f2174400b301d169e76d224f719c29fba978851e534c4071e63624d6ee0e9 /download/trigram_generate.cc
There are two programmes.
- Programme train_trigram counts the trigrams from a list of words on stdin and prints C declarations representing the counts of the trigrams on stdout. You redirect these declarations to a header file
- Programme trichar_generate then includes the header file
trichars.ghduring compilation. When run it uses the model defined by these counts to print random pseudowords using the library function random() to supply the entropy that they use.
The training set
train_trigram is intended to work on any list of words but
used here was generated from a training set derived as follows.
A BNC-derived word list was available from https://kilgarriff.co.uk/BNClists/all.al.gz in 2021. (After gunzip this became all.al with SHA256 ec9f4afad08c779521c521231dff04dfc621c971d55e735d2f4d7bf0abd2e4eb).
This file starts
100106029 !!WHOLE_CORPUS !!ANY 4124 1 !*?* unc 1 602 % nn0 113 1 %/100 unc 1 3 %/day unc 1
Some ad-hoc cleaning followed, removing tokens that had a count of only one, and then various parts of speech:
cat all.al | "grep -v "^1 " |fgrep -v -e "&" -e "_" -e "np0" -e "-" -e "(" -e " crd " -e " ord " -e "'" -e "/" -e "%" -e "." -e = -e " unc " -e " zz0 " |cut -f2 -d\ |grep -v "[0-9]" |sort -u
After this cleaning the list has 115,149 entries. It still contains tokens such as ‘aahah’ and it is possible that an improvement in subjective quality of the pseudowords could be obtained by more attention to cleaning.
This list was then combined with the list 256v1.sw published in Part 2. The subjective quality of the pseudowords seemed higher with the combined list than with the 8,167 words in 256v.sw alone.
- British National Corpus
- End of word. This term is used here for a conceptual symbol that follows the last character in a pseudoword and shows that the word has ended.
- GNU general public license