Innovation and design

Lepron project part 2: towards public key authentication without central authority using names made of pseudowords

by | Published

Background

This is the second article in a planned series of technical articles about Lepron, an open source project to develop a system for representing arbitrary binary strings as pronounceable pseudowords. This project is for the benefit of anyone who wants to use it or develop it further.

The first article [LEPRON] introduced three motivations for the project and this article concentrates on the second of these, which was to enable the authentication of cryptographic public keys from a (long) name without needing a central authority.

The difficulty this addresses has has sometimes been called “Zooko's Triangle” and the idea is explained in the earlier Clarion article [GLOBALNAMES].

Introduction

This article explores the idea of using a grammar for the encoding specification defined in [GLOBALNAMES]. It presents the software developed so far and explains what it does and why, followed by a discussion of results so far and ideas for further work.

The software is available for download under GPL (Gnu Public Licence). It comes with no warranty. This version is not intended for practical use, but for research and development. It consists of about 560 lines of C++.

It is available to download from this website at /download/lepron-r1.tgz (SHA256 27362660ae42214839bdc3bf73905f01eff482acf3cff3ce08c01165969e07e9).

What it does

It maps every number in the range of 0 to 652,800,000 - 1 (slightly more than 29 bits) to a different pronounceable pseudoword.

The number is passed on the command line and the programme prints the pseudoword. In the example that follows the binary string is represented by the decimal number 4839232 and the resulting pseudoword is “chinapinge”.

~ # ./encode 4839232
sizeof of unsigned long 4
encode: first grammar (only words starting with a consonant) follows:
encode: 598400000 is your range
encode: 4839232 encoded = chinapinge
encode: second grammar (words starting with a consonant or vowel) follows:
encode: 652800000 is your range
encode: 4839232 encoded = chinapinge
~ #

The programme can be tuned and rebuilt to change the value of this range. It does not yet convert the pseudoword back to the number.

In fact, in more detail, it performs two different mappings, but in its current configuration the first one is just a subset of the second, with range restricted to 598,400,000 different values.

Why it does it

The above is the external behaviour of the prototype. In internal behaviour it is also a prototype for an approach to the specification of the bijective mapping from number to pseudoword. The concept of this specification was introduced in [GLOBALNAMES] and was called the Encoding Specification. The goal is to develop a formalism for this specification that is general and could be used with pseudowords in any language or for multiple different mappings in the same language.

This goal goes beyond the original goal of Lepron which was to invent one useful mapping for English pseudowords. However the general framework will be needed for the goal of authenticating a public key from a name introduced in [GLOBALNAMES].

Details of the approach in this release are below in the section Encoding Specification.

How it works

This prototype takes a simplistic approach to forming words, which offers obvious scope for improvement. It works by treating the pseudoword as essentially a sequence of alternating clusters of consonants and vowels. It generates different permutations of these clusters to generate different pseudowords.

The clusters themselves are derived from a corpus or dictionary of real words.

In more detail, the consonant clusters are classified into a start cluster, which occurs at the start of the word, a middle cluster, which occurs between vowel clusters, and an end cluster, which occurs at the end of the word and always includes any trailing vowels on the end of the word.

There are two executable programmes in this release. The first programme (lepron.cc) accepts a list of real words and from it derives the vowel clusters and consonant clusters. It counts the occurrences of each cluster and can be tuned to keep only the N most frequent of each cluster type. Currently this is the only statistical technique used.

Then the second programme accepts this information (in a compile-time header file, terminals.gh) and simply joins these different clusters together to form a pseudoword in the manner just described. The current implementation always makes pseudowords of the following form:

<start><vowel><middle><vowel><middle><vowel><end>

The selection of the particular cluster within each of these cluster types is determined by the number being encoded.

The scope for obvious improvement is that generation does not not yet use n-gram frequencies or other statistical methods with these clusters. This means, for example, that every cluster of consonants is equally likely to follow a particular cluster of vowels, regardless of which vowel cluster it is.

The Encoding Specification

The earlier article [GLOBALNAMES] stated “A major open question is the encoding specification.”

The experimental approach here is to regard encoding and decoding the name as a parsing task. The name is a sentence in a language. The idea is that the language will be defined by a grammar within the encoding specification. Parsing the name will extract the the original binary string. Conversely, generating a name corresponds to generating a sentence in the language.

The grammar is a CFG (context free grammar). The type of CFG allowed is restricted in a very simple and severe manner. The restriction is that recursion or repetition is not allowed. This means the grammar is less powerful than a regular grammar on the Chomsky hierarchy. It also means that the language has a finite number of sentences and that the length of a sentence is bounded.

The restriction is easy to understand and enforce. For example, it could be implemented in a BNF (Backus-Naur Form) grammar specification by requiring that every non-terminal used on the right hand side of a rule had already been defined on the left-hand side of a rule further up the page.

The grammar will be written into the encoding specification using some formal notation, such as BNF or a derivative of it. This part is not implemented yet. In the current software, the grammar exists only in abstract form, represented by data structures built within the programme.

Design motivations

As explained in [GLOBALNAMES], checking an encoding specification means verifying that all of the 2128 (or more) possible different values of the key fingerprint map to different names. Checking the encoding specification is critical to the security of the system.

Enumerating the names to check this is intractable, so the system needs to be able to prove that the specification would produce 2128 different names, without actually producing them. A grammar is one possible solution to this.

If the grammar is unambiguous and if it can be demonstrated that every different value of the fingerprint causes a unique selection of nodes (non-terminals or terminals) in the parse tree, then it follows that each value of the fingerprint produces a unique name.

Possible benefits of this approach are

So the idea here is that in generating a pseudoword the choice of each optional production rule from the grammar depends on the value of the number being encoded. This happens without any explicit specification of numbers in the grammar. The way this is done is explained below.

Encoding from a supertree that defines the grammar

Figure 1 illustrates what will be called here a supertree for a grammar. A supertree is the superposition of every possible parse tree in the language.

                   pseudoword
                        |
   ------------------------------------------
   |         |             |         |      |
   V         V             V         V      V
{start} vowel_middle vowel_middle {vowel} {end}
   |         |             |         |      |
   V         |             |         V      V
 -----  ---------        --------  -----  -----
 |||||  |       |        |      |  |||||  |||||
        V       V        V      V
     {vowel} {middle} {vowel} {middle}
        |       |        |       |
        V       V        V       V
      -----  -------   -----  -------
      |||||  |||||||   |||||  |||||||

                pseudoword
                    |
  --------------------------------------
  |        |            |         |    |
  V        V            V         V    V
start vowel_middle vowel_middle vowel end
  |        |            |         |    |
  V        |            |         V    V
  ch   ---------     --------     i   nge
       |       |     |      |
       V       V     V      V
     vowel  middle vowel  middle
       |       |     |      |
       V       V     V      V
       i       n     a      p

{} = choice node: children are a set of options

Fig 1. A simple grammar's supertree, containing
every possible parse tree, with one of the
parse trees it generates.

The grammar shown in Figure 1 is the simpler of two grammars used in this release, defined in grammar.cc. The following is its BNF-style representation.

  1. <start> ::= "wr"| "str"| "sn"| "sw"| "z"| "th" ... (etc)
  2. <middle> ::= "k"| "rn"| "ng"| "ck"| "z"| "mm" ... (etc)
  3. <vowel> ::= "u"| "o"| "i"| "e"| "a"
  4. <vowel_middle> ::= <vowel> <middle>
  5. <end> ::= "nge"| "ft"| "sk"| "wn"| "be"| "na"| "gy" ... (etc)
  6. <pseudoword> ::= <start> <vowel_middle> <vowel_middle> <vowel> <end>

Parsing or generation consists of conceptually pruning the supertree until only the required parse tree remains.

The supertree is defined in the following way. It has terminal and non-terminal nodes like a traditional parse tree, but every node must be either a sequence node or a choice node. Note that the prohibition of sequence and choice in the same node or rule imposes a constraint on the way the grammar is written but it is only a cosmetic one that does not alter its expressive power.

A sequence node has an ordered list of children that always appear in the parse tree in the same order as in the supertree.

A choice node has an unordered set of children of which one, and only one, must be used in the parse tree of a particular sentence. The choice nodes are the points at which the pruning is done.

In this grammar all the choice nodes are terminals, but this does not have to be the case. The second grammar example in grammar.cc includes a non-terminal choice node, that decides whether to produce a word starting with a vowel or a word starting with a consonant.

Arithmetic or range encoding is used to map from a number to a name. The grammar supertree allows a conceptually simple approach to this encoding as follows.

Every node counts the number of different terminal strings that can appear under it in a parse tree. This number is called its range. In the case of a sequence node, this number is the product of the ranges of all its children. In the case of a choice node, it is the sum of the ranges of all its children. The reasons for this are obvious.

Encoding then consists of a node accepting an integer between 0 and R - 1 inclusive, where R is its range, and producing in response the terminal string corresponding to the number.

A non-terminal node does this by delegating the task to its children. The way in which it does this depends on whether it is a sequence node or a choice node, as follows.

The encoding at a sequence node

A sequence node splits the number into the appropriate range for each of its children in turn and passes that integer to the child. It extracts the numbers for its children by treating them as digits in a variable-base integer.

For example, suppose a sequence node has two children, one with a range of 3 and one with a range of 10. That means that its own range is 30 and during encoding it will be asked to encode an integer between 0 and 29. Let the number it is asked to encode be N. Then it treats N as if it were a two-digit number in a variable-base number system as follows:

N = A*3 + B

where A (0-9) is the integer passed to the child with range 10 and B (0-2) is the integer passed to the child with range 3. The split for the children is performed with division and/or modular arithmetic. In this example:

B = N (mod 3)

A = N/3 ignoring remainder.

The encoding at a choice node

A choice node uses range encoding to choose one of its children and passes that child the integer which it calculates is the offset into that child's range. Suppose for example that a choice node has three children, which have ranges of 5, 10 and 17 respectively. Then the range of the parent node will be 32 and during encoding it will be passed an integer in the range 0 to 31 inclusive.

Conceptually the parent node is able to produce 32 different terminal strings and it is responsible for defining a bijective map of these on to the integers 0 to 31. The parent achieves this by mapping the first 5 integers to the first child, the second 10 integers to the second child and the last 17 integers to the third child.

So for example if it is passed a value of N = 27 to encode, it calculates that N is beyond the range of the both the first and second child (since N > 14). It therefore allocates the encoding to the third child, and the offset into that child's range will be 27 - 14. The parent therefore passes the third child an integer of value 13 to encode.

A conceptual nicety of representing the grammar supertree

Figure 2 shows the internal data structures used in the current programme (grammar.cc) to represent the grammar as a supertree.

           pseudoword
               |
  ---------------------------
  |        |     |     |    |
  V        V     V     |    V
start   vowel_middle   |   end
              |        |
           -------     |
           |     |     |
           |     V     |
           \   middle  /
            \         /
             \       /
              \     /
               \   /
                \ /
                | |
                V V
               vowel

Fig 2. The programme's representation
of the supertree of the grammar
(omitting terminals)

It is apparent that that this structure does not directly represent a tree. It cannot fit the definition of a tree because the node vowel has two incoming edges and also because there are multiple edges between a pair of nodes: pseudoword and vowel_middle.

These conceptual difficulties can be resolved if it is viewed as a compressed tree. From this point of view, the reason for two edges to vowel_middle node from word is compression. In the same way that a compression algorithm might replace the second occurrence of a duplicate piece of text with a reference to the first occurrence, the grammar tree is compressed so that when a second child occurs that is identical to another child then it is simply replaced with a reference to the first child. This grammar supertree could be uncompressed before using it, but in reality that would make no difference to the way the programme functions because the nodes in the tree do not store any information specific to a particular parse. Uncompressing would only mean making identical copies of nodes like vowel and vowel_middle.

In the programme of course there has been no actual attempt to compress. This representation has been derived directly from the grammar rules. There is a direct correspondence between each node of the (conceptually compressed) grammar supertree and each rule of the grammar. Each node represents the left side of a rule. And where the right side of a rule refers to a node defined earlier in the grammar, a directed arc is made from the left node to that earlier node.

An alternative way of viewing this might be to argue that a child in the grammar supertree is a type definition whereas the corresponding child in the parse tree is an instantiation of that type. The reason that type definitions look superficially similar to instantiations is that the definition of a type is the definition of what it contains, in the same way that the type definition for a structure in a programming language is a statement of its members.

Results

As noted in [GLOBALNAMES] the lepron project as a whole has shown that usable pseudowords can carry enough information to make a 128-bit name feasible, despite a simplistic method for producing them.

The article included the example words “foupouren tottymouge scourivynd flemiongunge”.

Although these are usable, the quality (in my subjective view) has improved since restricting the vowel segment to only the 5 most frequent in the dictionary. which is the default in the software released here.

For reference, to change this value in the code, see the following.

--- a/lepron.cc
+++ b/lepron.cc
@@ -293,7 +293,7 @@ int main(int argc, char* argv[])
transfer2vector(end_fragments, end_fragvec);
trim_to_top_n(start_fragvec, 44);
-    trim_to_top_n(vowel_fragvec, 5);
+    trim_to_top_n(vowel_fragvec, 9);
trim_to_top_n(middle_fragvec, 40);
trim_to_top_n(end_fragvec, 68);

In the current software, changing the restriction back from the 5 most frequent vowel segments to the 9 most frequent vowel segments increases the number of words (those starting with a consonant) from 652,800,000 to 3,489,868,800 - from slightly more than 29 bits to slightly more than 31 bits.

But it also means that words like “yonyckionge” are generated which (again in my subjective opinion) is not of sufficient quality.

The software here did not attempt to improve the method of production but to demonstrate a parsing approach as a framework, which it has done.

Further work

The aim now is to improve the quality of the words by using statistical techniques and information such as n-gram frequencies. and after this to investigate whether and how these can be included in the parsing framework described in this article.

Clearly a grammar can produce any finite set of words. In the worst case there could be a grammar rule for each word. The question is whether the number of grammar rules will be tractable, and (possibly) whether deducing them will be tractable.

So for example consider the toy grammar

  1. <start> ::= "r"| "str"| "sn"| "y"
  2. <vowel> ::= "u"| "o"| "i"| "e"| "a" |"y"
  3. <end> ::= "nge"| "ft"| "s"| "n"| "b"| "na"
  4. <pseudoword> ::= <start> <vowel> <end>

This grammar produces words like “strynge”, “ron” and “yaft”. Suppose you want to avoid the potential double “y” causing a word like “yywn” but allow all the other words.

So you could reformulate the grammar something like the following:

  1. <start_without_y> ::= "r"| "str"| "sn"
  2. <start> ::= <start_without_y> | "y"
  3. <vowel_without_y> ::= "u"| "o"| "i"| "e"| "a"
  4. <vowel> ::= <vowel_without_y> | "y"
  5. <word_starting_y> ::= "y" <vowel_without_y> <end>
  6. <word_not_starting_y> ::= <start_without_y> <vowel> <end>
  7. <end> ::= "nge"| "ft"| "s"| "n"| "b"| "na"
  8. <pseudoword> ::= <word_starting_y> | <word_not_starting_y>

The more complex grammar can perhaps be generated automatically from a simple grammar with some additional constraints or annotations. There may be something to be learned from feature and unification grammars. So in the above example the first grammar would be annotated in some way with the restriction “no double y” and the system would split the existing grammar rules and produce new rules like the above.

This might not be tractable if each restriction keeps doubling the number of rules involved, so that growth is exponential with respect to the number of restrictions.

Next steps

  1. Investigate The Generalized Smallest Grammar Problem by Payam Siyari and Matthias Gallé, 2016. The abstract says “We investigate the reasons and propose an extended formulation that seeks to minimize non-recursive grammars, instead of straight-line programs.”

  2. Use bi-gram frequencies from the dictionary (bi-grams of the vowel and and consonant clusters, not of letters) to generate pseudowords, and investigate their quality.

  3. Is it possible to generate a grammar which explicitly or implicitly includes this statistical information and consequently generates better pseudowords?

References

Links to these are below

[GLOBALNAMES]
An idea for public key authentication from a name without certificates or central authority, Stephen Hewitt, Cambridge Clarion, 14 May 2021
[LEPRON]
Lepron, a project to develop pronounceable pseudowords for representing binary strings - Part 1: introduction and goals, Stephen Hewitt, Cambridge Clarion, 2 July 2021

Related articles

External links