Innovation and design

Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams

by | Published | Last updated

It is possible to create usable pseudowords with more than 3 bits of information per letter using a trigram letter model of English. Software demonstrating this is presented here under a GPL free software licence. The first twenty pseudowords generated when the software is run are shown in Table 1. This list is representative and words on it have not been subjectively selected.

As discussed in previous articles, such pseudowords have potential as:

Table 1: The first twenty pseudowords generated by a letter trigram model plus an heuristic to limit the length of the words. The entropy is the theoretical number of bits needed to encode the word in this model. The entropy per letter is that entropy divided by the number of letters but see the text for a caveat about the interpretation of this.
Pseudoword Letters Entropy (bits) Bits per letter
mensiothfur 11 38.06 3.46
heratorre 9 28.61 3.18
seduvrappla 11 37.06 3.37
diseteros 9 29.14 3.24
bedrable 8 24.63 3.08
homer 5 17.49 3.50
jewarals 8 30.70 3.84
mochipor 8 29.14 3.64
slatante 8 26.67 3.33
coridal 7 23.74 3.39
messed 6 16.44 2.74
distem 6 21.98 3.66
atertions 9 24.24 2.69
boils 5 20.68 4.14
reurs 5 18.00 3.60
schizerair 10 34.54 3.45
liffer 6 20.43 3.40
boupartnat 10 41.32 4.13
voider 6 19.80 3.30
nestes 6 17.36 2.89
AVERAGE 7.65 26.00 3.40

The trigram model and information content

A trigram model means that as the pseudoword is constructed the choice for the next letter depends on the previous two letters. The model is trained using a corpus or dictionary of words. For each ordered pair of previous two letters the model counts the occurrence of the third letter at every position in every word in the training set. Later, when producing a pseudoword, it mimics this occurrence probability of the third letter.

The implementation here uses 26 letters plus one additional symbol (represented here as ‘0’) to represent a word boundary. It represents either a character before the start of the word or a character after the end of the word and it is clear from the context which, because trigrams here do not span words.

So for example, during training the trigrams extracted and counted from the word ‘cabbage’ would be: 00c, 0ca, cab, abb, bba, bag age, ge0

Suppose during training the counts accumulated for the letters following ‘ab’ are:

‘0’=51,‘a’=153,‘b’=143,‘c’=0,‘d’=33... and so on.

Then the word ‘cabbage’ in the training set would have contributed 1 to towards the count of ‘b’=143

Let T be the total of the counts of all possible letters following ‘ab’. Then in this example

T = 51 + 153 + 143 + 0 + 33 ...

These counts mean that whenever the two previous letters in a pseudoword are ‘ab’ then the probability of ‘b’ for the next letter should be 143/T.

Therefore according to Shannon's information theory the information gained in this context by discovering that the next letter is ‘b’ is log2(T/143) bits. The entropy values in Table 1 were derived on this principle.

In the experimental software here, these bits of information that determine the next letter in the pseudoword are obtained by calling a random number generator and so generating random pseudowords, but in a real application like those listed above, an arithmetic or other encoder would instead take bits from the binary string being encoded. For each word in Table 1 the entropy value shows how many bits from the string would be encoded in this way by an optimal encoder.

The value of entropy per letter in the table is the total entropy of the word divided by the number of letters. From a human usability point of view this gives an estimate of how many letters are needed to convey n bits of information.

From an encoding point of view it is not the whole story because some of the information carried by the pseudoword comes from where the it ends and is not associated with any particular letter. The information that the word has ended is not obtained by reading the last letter of the word, but only by trying to read beyond it. In effect there is an end-of-word marker after the last letter, so a pseudoword of n letters has n + 1 symbols.

In concrete terms, if a long binary string were to be encoded by a sequence of readable pseudowords there would be spaces between the words. These spaces would also be necessary information because it is not possible in general to tell where one of these pseudowords ends and the next starts without them. So a naive interpretation of the value for bits per letter would overestimate the coding efficiency.

Pure trigrams plus a length heuristic

The model used here has been extended from the pure trigram model just described, in order to produce pseudowords of more predictable length. The extension consists of faking the count of the end of word (EOW) symbol being used by the generator.

To avoid very short words, for the first four characters the pseudoword generator pretends that every count for an EOW symbol in its model is zero.

To encourage words not to be very long, once six characters have been generated the generator pretends that the every count for an EOW symbol is twice what it actually was in the training set. And with each letter that is still added to the pseudoword, it keeps doubling the factor by which it inflates every EOW count.

Before the introduction of this heuristic the generator produced for example yerferbinecinceptiantpatestabionings (119.97 bits) and suggelingindefersubia (73.13 bits).

There is scope for optimisation of this heuristic. Experiments on it have not been performed. For example it could:

Comparison with a more primitive model

Table 2 shows twenty randomly-selected pseudowords generated by the simple model presented in Part 2 using the same training set as the trigram model in Table 1. Described in detail in Part 2, essentially it generates pseudowords with alternating segments of consonants and vowels extracted from the training set, with every permutation of the segments equally likely. It can be tuned by selecting the n most frequent segments for each of its segment categories. With the tuning used here there were a total of 598,400,000 possible permutations making the entropy of every word log2(598400000) or ~29.16 bits.

Table 2: For comparison twenty randomly selected pseudowords generated by the simple model presented in Part 2, here using the same training set as the trigram model in Table 1. It uses alternating segments of consonants and vowels extracted from the training set, with every permutation of the segments equally likely.
Pseudoword Letters Entropy (bits) Bits per letter
thefuncings 11 29.16 2.65
crossotunts 11 29.16 2.65
clemassobly 11 29.16 2.65
spebancoll 10 29.16 2.92
spamperrame 11 29.16 2.65
blosonily 9 29.16 3.24
tuvethit 8 29.16 3.64
strusuketh 10 29.16 2.92
shinotintly 11 29.16 2.65
frazommow 9 29.16 3.24
frimmekoc 9 29.16 3.24
stactucuk 9 29.16 3.24
crossonguld 11 29.16 2.65
tensivirn 9 29.16 3.24
yampufonts 10 29.16 2.92
lirmiwuse 9 29.16 3.24
blecanecs 9 29.16 3.24
crerachiw 9 29.16 3.24
cugundeps 9 29.16 3.24
slurmuphusly 12 29.16 2.43
AVERAGE 9.85 29.16 2.96

GPL free software to download

Here is the experimental software that produced the pseudowords in Table 1. It consists of the following files with the following SHA256 hashes at the following locations on this website.

c0f6d92a8b91eec0f692fae2c6400fba63f8980b51f0322aa70ff904412ab194  /download/train_trigram.cc
2e8bf16d0e0e013a8dcf3f2a3f95b322ea9534e86a6c7981cd49cfc6b1001a2e  /download/trichars.gh
f53f2174400b301d169e76d224f719c29fba978851e534c4071e63624d6ee0e9  /download/trigram_generate.cc

There are two programmes.

The training set

The programme train_trigram is intended to work on any list of words but the trichars.gh used here was generated from a training set derived as follows.

A BNC-derived word list was available from https://kilgarriff.co.uk/BNClists/all.al.gz in 2021. (After gunzip this became all.al with SHA256 ec9f4afad08c779521c521231dff04dfc621c971d55e735d2f4d7bf0abd2e4eb).

This file starts

100106029 !!WHOLE_CORPUS !!ANY 4124
1 !*?* unc 1
602 % nn0 113
1 %/100 unc 1
3 %/day unc 1

Some ad-hoc cleaning followed, removing tokens that had a count of only one, and then various parts of speech:

cat all.al | "grep -v "^1 " |fgrep -v -e "&" -e "_" -e "np0" -e "-" -e "(" -e " crd " -e " ord " -e "'" -e "/" -e "%" -e "." -e = -e " unc " -e " zz0 " |cut -f2 -d\ |grep -v "[0-9]" |sort -u

After this cleaning the list has 115,149 entries. It still contains tokens such as ‘aahah’ and it is possible that an improvement in subjective quality of the pseudowords could be obtained by more attention to cleaning.

This list was then combined with the list 256v1.sw published in Part 2. The subjective quality of the pseudowords seemed higher with the combined list than with the 8,167 words in 256v.sw alone.

Terms

BNC
British National Corpus
EOW
End of word. This term is used here for a conceptual symbol that follows the last character in a pseudoword and shows that the word has ended.
GPL
GNU general public license

Related articles

External links