Innovation and design

Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams

by Stephen Hewitt | Published 9 December 2022 | Last updated 11 December 2022

It is possible to create usable pseudowords with more than 3 bits of information per letter using a trigram letter model of English. Software demonstrating this is presented here under a GPL free software licence. The first twenty pseudowords generated when the software is run are shown in Table 1. This list is representative and words on it have not been subjectively selected.

As discussed in previous articles, such pseudowords have potential as:

pronounceable passwords of quantifiable strength
parts of a globally unique name that can prove itself to belong to a particular cryptographic public key
a more usable representation of a public key fingerprint than hexadecimal.

Table 1: The first twenty pseudowords generated by a letter trigram model plus an heuristic to limit the length of the words. The entropy is the theoretical number of bits needed to encode the word in this model. The entropy per letter is that entropy divided by the number of letters but see the text for a caveat about the interpretation of this.
Pseudoword	Letters	Entropy (bits)	Bits per letter
mensiothfur	11	38.06	3.46
heratorre	9	28.61	3.18
seduvrappla	11	37.06	3.37
diseteros	9	29.14	3.24
bedrable	8	24.63	3.08
homer	5	17.49	3.50
jewarals	8	30.70	3.84
mochipor	8	29.14	3.64
slatante	8	26.67	3.33
coridal	7	23.74	3.39
messed	6	16.44	2.74
distem	6	21.98	3.66
atertions	9	24.24	2.69
boils	5	20.68	4.14
reurs	5	18.00	3.60
schizerair	10	34.54	3.45
liffer	6	20.43	3.40
boupartnat	10	41.32	4.13
voider	6	19.80	3.30
nestes	6	17.36	2.89
AVERAGE	7.65	26.00	3.40

The trigram model and information content

A trigram model means that as the pseudoword is constructed the choice for the next letter depends on the previous two letters. The model is trained using a corpus or dictionary of words. For each ordered pair of previous two letters the model counts the occurrence of the third letter at every position in every word in the training set. Later, when producing a pseudoword, it mimics this occurrence probability of the third letter.

The implementation here uses 26 letters plus one additional symbol (represented here as ‘0’) to represent a word boundary. It represents either a character before the start of the word or a character after the end of the word and it is clear from the context which, because trigrams here do not span words.

So for example, during training the trigrams extracted and counted from the word ‘cabbage’ would be: 00c, 0ca, cab, abb, bba, bag age, ge0

Suppose during training the counts accumulated for the letters following ‘ab’ are:

‘0’=51,‘a’=153,‘b’=143,‘c’=0,‘d’=33... and so on.

Then the word ‘cabbage’ in the training set would have contributed 1 to towards the count of ‘b’=143

Let T be the total of the counts of all possible letters following ‘ab’. Then in this example

T = 51 + 153 + 143 + 0 + 33 ...

These counts mean that whenever the two previous letters in a pseudoword are ‘ab’ then the probability of ‘b’ for the next letter should be 143/T.

Therefore according to Shannon's information theory the information gained in this context by discovering that the next letter is ‘b’ is log₂(T/143) bits. The entropy values in Table 1 were derived on this principle.

In the experimental software here, these bits of information that determine the next letter in the pseudoword are obtained by calling a random number generator and so generating random pseudowords, but in a real application like those listed above, an arithmetic or other encoder would instead take bits from the binary string being encoded. For each word in Table 1 the entropy value shows how many bits from the string would be encoded in this way by an optimal encoder.

The value of entropy per letter in the table is the total entropy of the word divided by the number of letters. From a human usability point of view this gives an estimate of how many letters are needed to convey n bits of information.

From an encoding point of view it is not the whole story because some of the information carried by the pseudoword comes from where the it ends and is not associated with any particular letter. The information that the word has ended is not obtained by reading the last letter of the word, but only by trying to read beyond it. In effect there is an end-of-word marker after the last letter, so a pseudoword of n letters has n + 1 symbols.

In concrete terms, if a long binary string were to be encoded by a sequence of readable pseudowords there would be spaces between the words. These spaces would also be necessary information because it is not possible in general to tell where one of these pseudowords ends and the next starts without them. So a naive interpretation of the value for bits per letter would overestimate the coding efficiency.

Pure trigrams plus a length heuristic

The model used here has been extended from the pure trigram model just described, in order to produce pseudowords of more predictable length. The extension consists of faking the count of the end of word (EOW) symbol being used by the generator.

To avoid very short words, for the first four characters the pseudoword generator pretends that every count for an EOW symbol in its model is zero.

To encourage words not to be very long, once six characters have been generated the generator pretends that the every count for an EOW symbol is twice what it actually was in the training set. And with each letter that is still added to the pseudoword, it keeps doubling the factor by which it inflates every EOW count.

Before the introduction of this heuristic the generator produced for example yerferbinecinceptiantpatestabionings (119.97 bits) and suggelingindefersubia (73.13 bits).

There is scope for optimisation of this heuristic. Experiments on it have not been performed. For example it could:

allow more letters before it starts to increase the probability of the word ending
cut off less sharply
be based on syllables rather than letters
be based on the entropy of the word so far rather than letters

Comparison with a more primitive model

Table 2 shows twenty randomly-selected pseudowords generated by the simple model presented in Part 2 using the same training set as the trigram model in Table 1. Described in detail in Part 2, essentially it generates pseudowords with alternating segments of consonants and vowels extracted from the training set, with every permutation of the segments equally likely. It can be tuned by selecting the n most frequent segments for each of its segment categories. With the tuning used here there were a total of 598,400,000 possible permutations making the entropy of every word log₂(598400000) or ~29.16 bits.

Table 2: For comparison twenty randomly selected pseudowords generated by the simple model presented in Part 2, here using the same training set as the trigram model in Table 1. It uses alternating segments of consonants and vowels extracted from the training set, with every permutation of the segments equally likely.
Pseudoword	Letters	Entropy (bits)	Bits per letter
thefuncings	11	29.16	2.65
crossotunts	11	29.16	2.65
clemassobly	11	29.16	2.65
spebancoll	10	29.16	2.92
spamperrame	11	29.16	2.65
blosonily	9	29.16	3.24
tuvethit	8	29.16	3.64
strusuketh	10	29.16	2.92
shinotintly	11	29.16	2.65
frazommow	9	29.16	3.24
frimmekoc	9	29.16	3.24
stactucuk	9	29.16	3.24
crossonguld	11	29.16	2.65
tensivirn	9	29.16	3.24
yampufonts	10	29.16	2.92
lirmiwuse	9	29.16	3.24
blecanecs	9	29.16	3.24
crerachiw	9	29.16	3.24
cugundeps	9	29.16	3.24
slurmuphusly	12	29.16	2.43
AVERAGE	9.85	29.16	2.96

GPL free software to download

Here is the experimental software that produced the pseudowords in Table 1. It consists of the following files with the following SHA256 hashes at the following locations on this website.

c0f6d92a8b91eec0f692fae2c6400fba63f8980b51f0322aa70ff904412ab194  /download/train_trigram.cc
2e8bf16d0e0e013a8dcf3f2a3f95b322ea9534e86a6c7981cd49cfc6b1001a2e  /download/trichars.gh
f53f2174400b301d169e76d224f719c29fba978851e534c4071e63624d6ee0e9  /download/trigram_generate.cc

There are two programmes.

Programme train_trigram counts the trigrams from a list of words on stdin and prints C declarations representing the counts of the trigrams on stdout. You redirect these declarations to a header file trichars.gh.
Programme trichar_generate then includes the header file trichars.gh during compilation. When run it uses the model defined by these counts to print random pseudowords using the library function random() to supply the entropy that they use.

The training set

The programme train_trigram is intended to work on any list of words but the trichars.gh used here was generated from a training set derived as follows.

A BNC-derived word list was available from https://kilgarriff.co.uk/BNClists/all.al.gz in 2021. (After gunzip this became all.al with SHA256 ec9f4afad08c779521c521231dff04dfc621c971d55e735d2f4d7bf0abd2e4eb).

This file starts

100106029 !!WHOLE_CORPUS !!ANY 4124
1 !*?* unc 1
602 % nn0 113
1 %/100 unc 1
3 %/day unc 1

Some ad-hoc cleaning followed, removing tokens that had a count of only one, and then various parts of speech:

After this cleaning the list has 115,149 entries. It still contains tokens such as ‘aahah’ and it is possible that an improvement in subjective quality of the pseudowords could be obtained by more attention to cleaning.

This list was then combined with the list 256v1.sw published in Part 2. The subjective quality of the pseudowords seemed higher with the combined list than with the 8,167 words in 256v.sw alone.

Terms

BNC: British National Corpus
EOW: End of word. This term is used here for a conceptual symbol that follows the last character in a pseudoword and shows that the word has ended.
GPL: GNU general public license

Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams

The trigram model and information content

Pure trigrams plus a length heuristic

Comparison with a more primitive model

GPL free software to download

The training set

Terms

Related articles

Authenticating a public key from just a name - a feasibility study with GPL code

An idea for public key authentication from a name without certificates or central authority

Lepron, a project to develop pronounceable pseudowords for representing binary strings - Part 1: introduction and goals

Lepron project part 2: towards public key authentication without central authority using names made of pseudowords

Lepron project part 3: A first attempt at pronounceable passwords using dice for quantified strength

How to remember a provably strong password: a new way using ‘constrained choice’

How to remember a 128-bit key using ‘constrained choice’

External links