Lepron, a project to develop pronounceable pseudowords for representing binary strings - Part 1: introduction and goals

by Stephen Hewitt | Published 2 July 2021

Fig. 1: Chrome web browser on Windows in 2021 displaying a certificate fingerprint as a hexadecimal string ("4ad4c7421"...). The purpose of the Lepron project described here is to create a more human-friendly representation for such strings.

Introduction

This article introduces Lepron, a project to develop a system for representing arbitrary binary strings as pronounceable pseudowords. This open-source project is for the benefit of anyone who wants to use it or develop it further.

By pseudowords, I mean strings of letters that look like English words but aren't, such as sneldorton, jelmockenot and hermasile. They have the morphology of English words and a foreigner with limited vocabulary might not be able to distinguish between them and true words.

Lewis Carroll wrote in his 1871 novel Through the Looking-Glass:

'Twas brillig and the slithy toves
did gyre and gimble in the wabe
all mimsy were the borogroves
and the mome roth outgrabe.

The “borogroves” and the words like it are pseudowords.

The object is to define a bijective mapping between all possible values of a binary string and sequence of pronounceable English pseudowords, and to develop open-source software that can perform the mapping in both directions.

This is the first article in a planned series of articles about this on-going work. The series will be written from the perspective of my current knowledge which is imperfect and evolving during the project.

One of the first things I learned from internet searches is that pseudowords are already a well-known concept amongst linguists. See for example [WUGGY]. It is possible that some of the research on this topic will be useful.

Motivations

There are three motivations for this project.

  1. The specific goal of pronounceable passwords of quantified strength.

  2. The specific goal of developing the idea in the Clarion article An idea for public key authentication from a name without certificates or central authority, which itself was inspired by initial experiments creating pseudowords for Lepron.

  3. The more vague idea that a more human-friendly representation of cryptographic hashes would be useful in authentication and mitigating censorship.

Hashes may in general be useful for authenticating information from the internet or elsewhere, as explained in more detail below.

Fig. 2: The email address and PGP key fingerprint of journalists in an alphabetical list on the Guardian web site at https://www.theguardian.com/pgp in May 2021. The fingerprint is shown as a hexadecimal string.

The more general motivation is based on the following premises:

  1. Authentication of information from the web can be useful for overcoming censorship as explained below.

  2. A human-friendly representation of cryptographic hash values, which are binary strings, is likely to be important and useful for authentication, perhaps in ways currently unforeseen.

  3. The current representation of hash values using hexadecimal, an arbitrary mixture of letters and numbers, as shown in Fig. 1 and 2, is very poor from a usability point of view. If these strings ever have to be viewed by people - and the two figures illustrate situations when they currently are presented to people - then it would be useful to improve on this.

The poor usability of mixed letters and numbers can be deduced partly by intuition but it is also backed by research. One recent paper is [BLANCHARD], which reports better usability for consonant-vowel-consonant constructions for passwords than for mixed case or alphanumeric.

There can be a distinction between cases where numbers and letters are arbitrarily mixed as if they are part of the same character set, and cases where numbers are restricted to particular positions and letters are restricted to certain other positions. An example of the latter is the British postcode, for example "CB1 2AD". Typically, these postcodes start with one or two letters, then have numbers, then finish with two letters.

Hexadecimal represents the worse of these two cases because both numbers and letters can appear interchangeably at any position.

Mitigating censorship

For the purpose of helping people to authenticate information from the web, one method works by comparing the cryptographic hash of a file with the known authentic cryptographic hash. This is a way in which a large amount of data can be authenticated by verifying a relatively short string.

This kind of authentication typically serves the purpose of someone checking that a file is what they expected to get or that a public cryptographic key really belongs to the person they think it does. For example, Figure 1 shows a web browser displaying a 20 byte cryptographic hash from the certificate of the public key used to secure the gov.uk website. The widely-used OpenPGP standard for email encryption and signing also specifies key fingerprints. Figure 2 shows the Guardian newspaper website listing some PGP fingerprints of its reporters. These can be printed on business cards or verified in person.

(Incidentally in both of these figures the cryptographic hash shown is SHA1. Nothing in this article is meant to imply that using SHA1 is a good idea.)

A current use-case for visual comparison of a certificate fingerprint is is described in [BRINKMAN]. The context is that you are using a browser inside an organisation that you do not entirely trust. The idea is that you check that there is no current MITM attack on the HTTPS connection of the web browser by comparing the certificate fingerprint reported by the browser (in hex, as in Fig 1) with the certificate fingerprint reported by a website such as grc.com, which helpfully provides the service of independently fetching the certificate of the website and displaying its fingerprint.

(By the way, this is not a reliable solution to the problem)

The general ways in which this kind of authentication can mitigate censorship are as follows. If you can authenticate a file using a hash then it does not matter how that file came to you. It could come over a peer-to-peer network like bitTorrent, or via email or from an unknown untrusted website, or be passed on a memory stick.

In the internet as it has largely existed so far, the website is an authenticator of its own contents. You believe that a page was written by a particular author because it is on their website. But censorship can remove a website and that is why alternative ways of authentication can help.

The second kind of authentication, using public key cryptography, goes like this. Suppose that the file that has come to you contains a news report from a reporter that you trust. Further suppose that the reporter has digitally signed the report using public key cryptography. So the signature is either part of the same file or comes with the file. Now if you already have the public key of the reporter you can verify the signature and you know you can trust the report. The provenance of the file - the history of how it got to you - does not matter.

Furthermore, the public key itself could also be included in the file you receive. As long as you already know the authentic fingerprint of the public key you can first authenticate the public key using the fingerprint and then use the public key to authenticate the report.

Pronounceable passwords of quantified strength

The idea here is that to form a password of known strength you generate a random binary string of your chosen length. For example if you want 128-bit strength, then you generate a string 128 bits long. Its representation as pseudowords is then your pronounceable password.

The principle is the same as the one in my 2018 article on memorable passwords [PW] documenting a method I called “constrained choice”. In both systems the password strength is guaranteed by an underlying binary string that can be regenerated from the password. The difference here is that there is no user choice. The binary string and the passwords have a bijective mapping, whereas in [PW] there was a surjective mapping, which gave some choice in the password (albeit constrained) to form a memorable mnemonic.

A particular implementation of this principle called the “Letter Pair” method in the article resulted in the following example:

bxwgftivbvmcmwplploygmctglbwmrpe

Years of personal experience using a passwords like this (of 125-bit strength) including for my main email account confirms that it is memorable but suggest that a pronounceable password may have one particular advantage: faster typing.

As a reasonably competent touch typist I have to slow down and concentrate on each letter when I enter this kind of password in a way that I do not have to when I enter a password made of pseudowords or real words.

When I invent words like karaltone and tolponte I can type these much faster and remember them and recognise them in the short term better then high entropy sequences like “bxwgftiv” so that I could compare for example something on the screen with something on paper in front of me more quickly, more reliably and with less mental effort.

When pronounceable pseudowords have been developed, it will be possible to empirically compare the total typing time of both kinds of passwords. The pronounceable password of equal entropy will have more letters but I suspect it may turn out to be easier and faster to type.

Tentative specification of requirements

The object is to define a bijective mapping between all possible values of a binary string and sequence of pronounceable English pseudowords, and to develop open-source software that can perform the mapping in both directions.

By pronounceable is meant that a English speaker would find it easy to form some kind of pronunciation, allowing them to mentally say it inside their head. It is not intended that the word can be unambiguously communicated in speech. The motive for making it pronounceable is to make it more memorable and recognisable by reading. For example treanolto might sound the same as treenolto, even though they are different words. (Distinguishing by pronunciation was a goal of an earlier article, specifically intended for communicating cryptographic key fingerprints by speech [1].)

It is also explicitly not a requirement that all speakers would necessarily decide to pronounce a particular word in the same way. For example the pseudoword frought could be pronounced the same as the true English word fraught or it could be pronounced to rhyme with the English word drought.

By bijective is meant that every binary string maps to a unique sequence of pseudowords and that every valid sequence of pseudowords maps back to a unique binary string. It is not a requirement that every sequence of valid pseudowords is a valid sequence of pseudowords. However, the decoder must map every valid sequence to its a binary string and must flag an error for every invalid sequence.

In other words the mapping really must be bijective. This is important because it is going to be used to compare binary strings. In using this representation we want to be sure that two binary strings are equal if and only if their two corresponding pseudoword sequences are the equal.

The length of the binary string is unbounded, but must be a multiple of 8, so a more detailed description might say octet string rather than binary string. The length should also not have to be known in advance when mapping from or to the corresponding pseudoword sequence.

Word breaks in the pseudoword sequence are significant. Of course, not every sequence of letters that looks like a pseudoword is in fact a valid pseudoword, and the mapping software must recognise an invalid pseudoword and halt decoding with an error message.

The pseudowords must be of reasonable length. The optimum length is something perhaps to be evaluated, as it is a trade-off between the number of words But the length should not significantly exceed the length of long English words. The motivation is that it must not tax the abilities of an English speaker to hold the word in their head in short term memory.

Note that in other languages the maximum acceptable length might be different. Software to map from pseudoword to binary must recognise illegal pseudowords and refuse the decoding with an error.

References

Links to these are below

[BRINKMAN]
Use Fingerprints to determine the authenticity of an Internet website, Martin Brinkmann, Ghacks, 27 July 2013
[PW]
How to remember a provably strong password: a new way using ‘constrained choice’, Stephen Hewitt, Cambridge Clarion, July 2018
[FINGER]
A simple way to represent cryptographic key fingerprints, Stephen Hewitt, Cambridge Clarion, 7 June 2020
[WUGGY]
Wuggy: A multilingual pseudoword generator, Emmanuel Keuleers and Marc Brysbaert, 2010

Related

External links