Lepron, a project to develop pronounceable pseudowords for representing binary strings - Part 1: introduction and goals
by Stephen Hewitt | Published 2 July 2021

Introduction
This article introduces Lepron, a project to develop a system for representing arbitrary binary strings as pronounceable pseudowords. This open-source project is for the benefit of anyone who wants to use it or develop it further.
By pseudowords, I mean strings of letters that look like English words but aren't, such as sneldorton, jelmockenot and hermasile. They have the morphology of English words and a foreigner with limited vocabulary might not be able to distinguish between them and true words.
Lewis Carroll wrote in his 1871 novel Through the Looking-Glass:
'Twas brillig and the slithy toves
did gyre and gimble in the wabe
all mimsy were the borogroves
and the mome roth outgrabe.
The “borogroves” and the words like it are pseudowords.
The object is to define a bijective mapping between all possible values of a binary string and sequence of pronounceable English pseudowords, and to develop open-source software that can perform the mapping in both directions.
This is the first article in a planned series of articles about this on-going work. The series will be written from the perspective of my current knowledge which is imperfect and evolving during the project.
One of the first things I learned from internet searches is that pseudowords are already a well-known concept amongst linguists. See for example [WUGGY]. It is possible that some of the research on this topic will be useful.
Motivations
There are three motivations for this project.
The specific goal of pronounceable passwords of quantified strength.
The specific goal of developing the idea in the Clarion article An idea for public key authentication from a name without certificates or central authority, which itself was inspired by initial experiments creating pseudowords for Lepron.
The more vague idea that a more human-friendly representation of cryptographic hashes would be useful in authentication and mitigating censorship.
Hashes may in general be useful for authenticating information from the internet or elsewhere, as explained in more detail below.

The more general motivation is based on the following premises:
Authentication of information from the web can be useful for overcoming censorship as explained below.
A human-friendly representation of cryptographic hash values, which are binary strings, is likely to be important and useful for authentication, perhaps in ways currently unforeseen.
The current representation of hash values using hexadecimal, an arbitrary mixture of letters and numbers, as shown in Fig. 1 and 2, is very poor from a usability point of view. If these strings ever have to be viewed by people - and the two figures illustrate situations when they currently are presented to people - then it would be useful to improve on this.
The poor usability of mixed letters and numbers can be deduced partly by intuition but it is also backed by research. One recent paper is [BLANCHARD], which reports better usability for consonant-vowel-consonant constructions for passwords than for mixed case or alphanumeric.
There can be a distinction between cases where numbers and letters are arbitrarily mixed as if they are part of the same character set, and cases where numbers are restricted to particular positions and letters are restricted to certain other positions. An example of the latter is the British postcode, for example "CB1 2AD". Typically, these postcodes start with one or two letters, then have numbers, then finish with two letters.
Hexadecimal represents the worse of these two cases because both numbers and letters can appear interchangeably at any position.
Mitigating censorship
For the purpose of helping people to authenticate information from the web, one method works by comparing the cryptographic hash of a file with the known authentic cryptographic hash. This is a way in which a large amount of data can be authenticated by verifying a relatively short string.
This kind of authentication typically serves the purpose of someone checking that a file is what they expected to get or that a public cryptographic key really belongs to the person they think it does. For example, Figure 1 shows a web browser displaying a 20 byte cryptographic hash from the certificate of the public key used to secure the gov.uk website. The widely-used OpenPGP standard for email encryption and signing also specifies key fingerprints. Figure 2 shows the Guardian newspaper website listing some PGP fingerprints of its reporters. These can be printed on business cards or verified in person.
(Incidentally in both of these figures the cryptographic hash shown is SHA1. Nothing in this article is meant to imply that using SHA1 is a good idea.)
A current use-case for visual comparison of a certificate fingerprint is is described in [BRINKMAN]. The context is that you are using a browser inside an organisation that you do not entirely trust. The idea is that you check that there is no current MITM attack on the HTTPS connection of the web browser by comparing the certificate fingerprint reported by the browser (in hex, as in Fig 1) with the certificate fingerprint reported by a website such as grc.com, which helpfully provides the service of independently fetching the certificate of the website and displaying its fingerprint.
(By the way, this is not a reliable solution to the problem)
The general ways in which this kind of authentication can mitigate censorship are as follows. If you can authenticate a file using a hash then it does not matter how that file came to you. It could come over a peer-to-peer network like bitTorrent, or via email or from an unknown untrusted website, or be passed on a memory stick.
In the internet as it has largely existed so far, the website is an authenticator of its own contents. You believe that a page was written by a particular author because it is on their website. But censorship can remove a website and that is why alternative ways of authentication can help.
The second kind of authentication, using public key cryptography, goes like this. Suppose that the file that has come to you contains a news report from a reporter that you trust. Further suppose that the reporter has digitally signed the report using public key cryptography. So the signature is either part of the same file or comes with the file. Now if you already have the public key of the reporter you can verify the signature and you know you can trust the report. The provenance of the file - the history of how it got to you - does not matter.
Furthermore, the public key itself could also be included in the file you receive. As long as you already know the authentic fingerprint of the public key you can first authenticate the public key using the fingerprint and then use the public key to authenticate the report.
Pronounceable passwords of quantified strength
The idea here is that to form a password of known strength you generate a random binary string of your chosen length. For example if you want 128-bit strength, then you generate a string 128 bits long. Its representation as pseudowords is then your pronounceable password.
The principle is the same as the one in my 2018 article on memorable passwords [PW] documenting a method I called “constrained choice”. In both systems the password strength is guaranteed by an underlying binary string that can be regenerated from the password. The difference here is that there is no user choice. The binary string and the passwords have a bijective mapping, whereas in [PW] there was a surjective mapping, which gave some choice in the password (albeit constrained) to form a memorable mnemonic.
A particular implementation of this principle called the “Letter Pair” method in the article resulted in the following example:
bxwgftivbvmcmwplploygmctglbwmrpe
Years of personal experience using a passwords like this (of 125-bit strength) including for my main email account confirms that it is memorable but suggest that a pronounceable password may have one particular advantage: faster typing.
As a reasonably competent touch typist I have to slow down and concentrate on each letter when I enter this kind of password in a way that I do not have to when I enter a password made of pseudowords or real words.
When I invent words like karaltone and tolponte I can type these much faster and remember them and recognise them in the short term better then high entropy sequences like “bxwgftiv” so that I could compare for example something on the screen with something on paper in front of me more quickly, more reliably and with less mental effort.
When pronounceable pseudowords have been developed, it will be possible to empirically compare the total typing time of both kinds of passwords. The pronounceable password of equal entropy will have more letters but I suspect it may turn out to be easier and faster to type.
Tentative specification of requirements
The object is to define a bijective mapping between all possible values of a binary string and sequence of pronounceable English pseudowords, and to develop open-source software that can perform the mapping in both directions.
By pronounceable is meant that a English speaker would find it easy to form some kind of pronunciation, allowing them to mentally say it inside their head. It is not intended that the word can be unambiguously communicated in speech. The motive for making it pronounceable is to make it more memorable and recognisable by reading. For example treanolto might sound the same as treenolto, even though they are different words. (Distinguishing by pronunciation was a goal of an earlier article, specifically intended for communicating cryptographic key fingerprints by speech [1].)
It is also explicitly not a requirement that all speakers would necessarily decide to pronounce a particular word in the same way. For example the pseudoword frought could be pronounced the same as the true English word fraught or it could be pronounced to rhyme with the English word drought.
By bijective is meant that every binary string maps to a unique sequence of pseudowords and that every valid sequence of pseudowords maps back to a unique binary string. It is not a requirement that every sequence of valid pseudowords is a valid sequence of pseudowords. However, the decoder must map every valid sequence to its a binary string and must flag an error for every invalid sequence.
In other words the mapping really must be bijective. This is important because it is going to be used to compare binary strings. In using this representation we want to be sure that two binary strings are equal if and only if their two corresponding pseudoword sequences are the equal.
The length of the binary string is unbounded, but must be a multiple of 8, so a more detailed description might say octet string rather than binary string. The length should also not have to be known in advance when mapping from or to the corresponding pseudoword sequence.
Word breaks in the pseudoword sequence are significant. Of course, not every sequence of letters that looks like a pseudoword is in fact a valid pseudoword, and the mapping software must recognise an invalid pseudoword and halt decoding with an error message.
The pseudowords must be of reasonable length. The optimum length is something perhaps to be evaluated, as it is a trade-off between the number of words But the length should not significantly exceed the length of long English words. The motivation is that it must not tax the abilities of an English speaker to hold the word in their head in short term memory.
Note that in other languages the maximum acceptable length might be different. Software to map from pseudoword to binary must recognise illegal pseudowords and refuse the decoding with an error.
References
Links to these are below
- [BLANCHARD]
- Consonant-Vowel-Consonants for Error-Free Code Entry, Nikola Blanchard, Leila Gabasova, Ted Selker, HCI International, July 2019, Orlando, USA
-
- [BRINKMAN]
- Use Fingerprints to determine the authenticity of an Internet website, Martin Brinkmann, Ghacks, 27 July 2013
- [FINGER]
- A simple way to represent cryptographic key fingerprints, Stephen Hewitt, Cambridge Clarion, 7 June 2020
- [PW]
- How to remember a provably strong password: a new way using ‘constrained choice’, Stephen Hewitt, Cambridge Clarion, July 2018
- [WUGGY]
- Wuggy: A multilingual pseudoword generator, Emmanuel Keuleers and Marc Brysbaert, 2010
Related
- An idea for public key authentication from a name without certificates or central authority May 2021 Stephen Hewitt
- Lepron project part 2: towards public key authentication without central authority using names made of pseudowords August 2021 An open source project by Stephen Hewitt
- Lepron project part 3: A first attempt at pronounceable passwords using dice for quantified strength September 2021 An open source project by Stephen Hewitt
- Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams Possible applications include pronounceable passwords and usable public key authentication
- An idea for human-friendly hex strings in cryptographic key fingerprints August 2022 Stephen Hewitt
- How to remember a provably strong password: a new way using ‘constrained choice’ July 2018, Stephen Hewitt. The 2nd Clarion data privacy article
- How to remember a 128-bit key using ‘constrained choice’ August 2018, Stephen Hewitt. The 3rd Clarion data privacy article
- A design for passwords with system-assigned randomness and user choice July 2023 Stephen Hewitt
- A simple way to represent cryptographic key fingerprints 7 June 2020, Stephen Hewitt. The 5th Clarion data privacy article
- Empirical explorations of faster Fermat factorisation, part 1 February 2022, A technical article - optimisations for a simple algorithm
External links
- Ghacks article: Use Fingerprints to determine the authenticity of an Internet website Martin Brinkmann, Ghacks, 27 July 2013
- HAL archives: Consonant-Vowel-Consonants for Error-Free Code Entry Nikola Blanchard, Leila Gabasova, Ted Selker, HCI International, July 2019, Orlando (PDF SHA256 91c33b61cde964aa427aaa803e78a5978cd205e046ac134c9d46d6505095dbca)
- Behavior Research Methods article: Wuggy: A multilingual pseudoword generator Emmanuel Keuleers and Marc Brysbaert, Behavior Research Methods, 2010 (PDF SHA256 19713a1f426b824b743c55d8c1d38703f56526da54b8990bc58a43642524ee59)