Innovation and design

Authenticating a public key from just a name - a feasibility study with GPL code

by Stephen Hewitt | Published 15 March 2024 | Last updated 18 March 2024

This article presents prototype GPL open-source software called Lepron that demonstrates the concept of authenticating a cryptographic public key from a long public name without needing trust. Lepron generates a public name from a natural language word model, which is chosen by the creator of the name. The article presents a word model for English and Italian and sample generated names, allowing evaluation of their human usability.

It discusses the further work that is needed before real-life use of the system. This work includes elucidating the security properties of the concept in general and the current implementation in particular.

The concept

The concept is that the long name itself defines a cryptographic public key fingerprint, as illustrated in Figure 1. This concept was described in an earlier article [GLOBALNAMES] and is briefly recapitulated here.


               Key file
          +-----------------+
          | Nonce           |
          | Associated info |
          | Grammar (PCFG)---->--+
          | Public key      |    |
          +-----------------+    |
                    |            V
                    V            |
            hash(Key file)       |
              |            +-----+
              |            |
              V            V
   generate(Fingerprint, PCFG)
       |
       |
       V          Public name
 +---------------------------------------------+
 | Benjamin Egesianky Rehippy Depiraund Hakuss |
 +---------------------------------------------+

Figure 1. A long public name of words and pseudowords representing the fingerprint of a cryptographic public key. Production rules in an unambiguous PCFG are driven by the numeric value of the hash to generate a name unique to that hash value.

A cryptographic hash as a key fingerprint is widely used but what is new here is the mapping of this numeric value to a readable name with the mapping definition itself included in the hash. This mapping is provided by the PCFG (probabilistic context free grammar). The grammar converts an input numeric value to a name of words and pseudowords by using range encoding to select production rules.

A new name is created starting with the public key in a key file and then deriving the name.

public key --> public key file --> name

The concept relies on the properties of a cryptographic hash to make it computationally intractable to go in the reverse direction, to start with a name and find a public key file that generates it.

To authenticate a key, a user takes a supplied key file and follows this same process, shown in Figure 1, to regenerate the name, also performing validity checks on the PCFG. The process is to calculate a cryptographic hash of the key file and then to use it as a numeric input to the PCFG which generates a corresponding name as output. If the PCFG is confirmed valid then the public key in the key file is considered authentic for the name generated.

To be valid the PCFG needs two properties, which must be checked during authentication:

It is unambiguous for the name being authenticated, meaning this name has only one parse tree.
Each possible parse tree can be generated by at most one numeric value of the input.

These two properties together mean that the name is generated by only one value of the cryptographic hash, resulting in the following proposition.

The security proposition

Suppose the key fingerprint input to the PCFG consists of n bits. In the random oracle model [ROM], given a particular target name that has been generated from a key file, for each hashing operation on a different key file there is at most a 1/2ⁿ chance of finding a collision with the target name.

Justification

In the random oracle model it is not possible to predict anything about the hash image before querying the hash function. Therefore, before querying the hash function, all 2ⁿ values of the key fingerprint are equally likely. At most one of these values will generate the target name if the PCFG is valid under the definition given above. Therefore for each hashing operation, there is at most a 1/2ⁿ chance of colliding with the target name.

Further work

Further study of the security provided by the system is needed for two reasons. In the real world, random oracles do not exist and proving security only with a random oracle is potentially problematic [REVISITED], [CAREFUL]. In this prototype n = 128 but further study is needed to understand whether this offers adequate security.

They are discussed in more detail in the section below.

Motivation and use-cases

The motivating use-case is representing a long-term identity of a person or an organisation. The public key would be a signing key, perhaps restricted to signing other public keys for every-day use.

One use case would be for a journalist or an editor to digitally sign articles. Digital signatures could provide an end-to-end authenticity guarantee from a reporter to a reader who has the reporter's long name. This could help to defend against misinformation and censorship because if the reader can authenticate a file using a digital signature then it does not matter where the file came from. It could come via email, over a peer-to-peer network, from an untrusted website, or be passed on a memory stick.

A human-readable name can be distributed widely, on business cards, letter headings and sign posts, appear on office buildings, slides and plaques and name badges at public talks and videos and be inserted into public records such as registered trademarks or registered companies. Making a public key public knowledge makes key equivocation - where a purported authority for a key sometimes shows a victim a different one - more difficult.

A globally unique name could even be used in a website address, for example:

https://stephen-olyndforbra-clanivelle-cloged-hewitt.org.uk

If a website with a long name became well-known before it was censored, then the name would already be well distributed amongst its readers, being in browser histories and caches, links from other websites, private emails and so on.

The threat model

The threat model is that the adversary can supply a false key file but they cannot manipulate the published name. The system succeeds in its purpose if an adversary has a negligible chance of supplying a false key file that induces someone who knows the long public name to use the wrong public key.

The system does not attempt to prevent a potential adversary from creating two key files with different keys that both map to the same arbitrary name created by that adversary. This represents the birthday so-called “paradox”. For this prototype implementation with n=128, this can be achieved with ~2⁶⁴ operations, which is tractable. However the ability of a would-be adversary to do this does not affect the security of anyone else's name, provided it was created independently.

Choosing a name

A new name can be created by its owner only in the same sequence as it is authenticated, starting with the public key in the key file and then deriving the name.

A nonce is included in the key file so that the owner can find a name by mining. This means repeatedly incrementing the nonce and generating another candidate name until an acceptable one is found.

The names in Figure 2, were generated in this way while accepting only those starting "Stephen" and ending "Hewitt". The prototype mining implementation presented here has not been optimised for speed and was able to test about 20,000 names/second on each core of a desktop PC. Each core took on average ~2.6 hours to find a one of these names.

Stephen Elpead Oplipuffress Saftmer Hewitt
Stephen Lardadifes Thbother Nousedfo Hewitt
Stephen Taphick Limficintapoc Wardet Hewitt
Stephen Nette Tibecona Ettle Bacimpop Hewitt
Stephen Olyndforbra Clanivelle Cloged Hewitt
Stephen Ortiness Sworgy Rughwascuisiz Hewitt
Stephen Unmatidned Pezempay Eloselibe Hewitt
Stephen Womeds Bortaxod Hinfluritimmi Hewitt
Stephen Abneack Rowerompons Serriandla Hewitt
Stephen Addepodge Vitersplea Imathumes Hewitt
Stephen Heampripollar Lakagrop Chewave Hewitt
Stephen Knoilessimp Puslonam Exismerta Hewitt
Stephen Pokcuoute Wafflewables Welloul Hewitt
Stephen Zingeing Psturfer Lodsts Jihop Hewitt
Stephen Businsed Prexpleber Lizuchestra Hewitt
Stephen Cellifting Inarkieblop Etchboad Hewitt
Stephen Colvilarflic Inenes Buifiestmag Hewitt
4b5fb098aa9264d849ef263992295094

Figure 2 Names generated by a prototype English word model, selected to start with “Stephen” and end with “Hewitt”. Each name represents a 128-bit fingerprint and the shortest name is 42 characters long. For comparison the hex fingerprint of one of the names is at the bottom. This is the first part of a list sorted by length of about 90 names that were found in about 72 hours on a desktop PC.

Abacca De Abaccasin Lusca Papassu Coretti Coranga
Menuorto Batta Leoni Maiolotte Prottozzo Palberro Manierti
Cassi Berrilini Silanceste Dinini Croni Spagnardi La Morasasa
Perrieri De Capuccara Belfiesta Uggiusco Damorinova Molego
Massini Naroste Tello Lucchi Gustini Abrubini Sansecchi Leo Gusca
Mantoruna De Corrafedri Burconi Leoni Vierra Santurlau
Vitinico Ferrasti Ratole Mellaro Lodini Bottini Ziti Giudi
Parardia Batti Zanotti Marice Sileni Balvala Romasqualazz
Zaroni Oruso Pizzi Frati Colondis Bevito Sacci De Roscarco
Alettini Maglippa Savalenti Rimorti Sanotti Caspi Biso Bovecci
Bosti Lanovatti Felli Belonego Magli Lantino Carli Di Aporeli
Bolfiermin Berra Valmi Marchescosi Gobbi Mazzi Ballo Rioli
Minini Nonevanna Cottano Essolino Bosicchi Trani Baggi Freoni
Aldicicia La Valti Facciasa Lazzanzo Letta Teliccia
Beltino Bagge Bericchi Zanevaleo Spadda Luccapolpini Altrubila
Catelli Baffo Parbini Nentanne Vionti Lumbertefana Barosca

Figure 3 Unfiltered names generated by a word trigram model trained on 1,000 Italian surnames. Each line represents a 128-bit fingerprint as in Figure 2, but here with lower average entropy per letter, reflected in a longer line length. Empirical trials showed an average of 2.18 bits/character, and an average of 7.5 words per 128-bit name.

Related work

Rivest and Lampson proposed a system where “principals” were public signature keys. [SDSI] Each principal would have its own own namespace and could issue a certificate asserting a binding between a public key and a name in its own namespace. Further this could be done with arbitrary levels of indirection using names from the namespaces of others. Elsewhere this has been called a “petname” system, although that word does not appear in the document.

Wachs, Schanzenbach and Grothoff wrote a paper presenting the GNU Name System “a censorship-resistant, privacy-preserving and decentralized name system designed to provide a secure alternative to DNS”. [GNS] The paper is mostly concerned with DNS (although DNS is not in the title) but they also wrote “As GNS can bind names to any kind of cryptographically secured token, it can double in some respects as an alternative to some of today's Public Key Infrastructures, in particular X.509 for the Web”. There is also a IETF RFC [RFC9498], dated November 2023, by Schanzenbach, Grothoff and Fix.

These approaches are related to Lepron in that they attempt to authenticate a public key from a name, but Lepron differs from these in that it does not require the cooperation of large numbers of people or reference to any external system, although that might also be helpful.

In 2014 there were closely related discussions on the email list messaging at moderncrypto.org according to its archives available in 2024, for example at https://moderncrypto.org/mail-archive/messaging/2014/000080.html .

The prototype implementation

In the prototype implementation here, the grammar is in its own file, separate from the key file, as shown in Figure 4. Authentication of a public key against a name therefore requires both a grammar file and a key file.


       Key file             Grammar file
 +-----------------+    +------------------+
 | Nonce           |    | Word PCFG        |
 | Associated info |    | %%               |
 | Public key      |    | Phrase Grammar   |
 | Hash(Grammar)<-------| %%               |
 | Length(Grammar) |    | Word permutation |
 +-----------------+    +------------------+
           |                         |
           V                         |
   hash(Key file)                    |
    |                                |
    |                                |
    V  Key fingerprint (128 bits)    |
 +--------------------------------+  |
 |4b5fb098aa9264d849ef263992295094|  |
 +--------------------------------+  |
                  |                  |
                  |          +-------+
                  |          |
                  V          V
    generate(Fingerprint, Grammar)
       |
       V
 +----------------------------------------------+
 | Stephen Olyndforbra Clanivelle Cloged Hewitt |
 +----------------------------------------------+

Figure 4. Authentication of a public key by regeneration of its owner's name, illustrated with a real name generated by the prototype implementation from the key file olynd.key and grammar file trigram.gram

One motivation for the split into two files in the prototype implementation is that hashing of the key file is faster than the grammar file because the key file is smaller. For example the demonstration grammar file is ~1.7Mb while the key file is ~1.3Kb. This makes name mining more tractable on a desktop CPU. Another motivation for a widely-used system would be that many key files might use the same grammar file and it would then be necessary to distribute a large grammar file only once rather than with every key.

The names of these files are passed on the command line to a programme called namekey which performs validity checks and if all is well prints a long name. The public key in the key file is considered authentic for the name printed.

Figure 5 shows namekey generating a name from the example files in this implementation.

~/lepron # ./namekey olynd.key trigram.gram
Lepron namekey proof-of-concept experimental test version v0
Do not use this demonstration prototype for real-world security
Copyright 2023 Stephen Hewitt www.cambridgeclarion.org
This is free software with absolutely no warranty
grammar_hash_hex 0c00a1a3352668ded548b0c3cf0d858e922a5de5
debug key file says grammar length =1702050
key file says grammar_hash_hex 0c00a1a3352668ded548b0c3cf0d858e922a5de5
Ngrammar says  grammar is acyclic
lepron double hash = 4b5fb098aa9264d849ef2639922950949bf5a461
raw = 4b5fb098aa9264d849ef263992295094
ms 64bits, ls 64bits 5431253845375345880 5327518912909758612
debug grammar nencode taking on board initial 62 bits
Input raw/commission 1357813461343836470 / 4611686018427387904
Output raw/commission 12491357686156 / 20886353735421; raw change = -1357800969986150314
hewitt  = 17.752 bits (2.536 bits/char)
----------------------------
debug node nencode taking on board 17 bits
Input raw/commission 1637267234639848695 / 2737616156809101312
Output raw/commission 50 / 340; raw change = -1637267234639848645
olyndforbra  = 52.838 bits (4.403 bits/char)
----------------------------
debug node nencode taking on board 49 bits
Input raw/commission 28471001354948756 / 191402984163246080
Output raw/commission 12176836 / 79500184; raw change = -28471001342771920
clanivelle  = 31.165 bits (2.833 bits/char)
----------------------------
Input raw/commission 12176836 / 79500184
Output raw/commission 85 / 244; raw change = -12176751
cloged  = 18.314 bits (2.616 bits/char)
----------------------------
Input raw/commission 85 / 244
Output raw/commission 0 / 1; raw change = -85
stephen  = 7.931 bits (0.991 bits/char)
----------------------------
name size including spaces between words 45
name entropy = 128.000 bits (2.844 bits/char)
Generated name = [hewitt#olyndforbra#clanivelle#cloged#stephen##]
Permuted name = [stephen#olyndforbra#clanivelle#cloged#hewitt#]
Authentication successful.  The key file is valid for the following name:
Stephen Olyndforbra Clanivelle Cloged Hewitt
~/lepron #

Figure 5. Namekey generating a name from the example files included in this implementation.

Also in this implementation is an executable called demogram which exists purely to demonstrate concepts relating to the grammar. Rather than hashing the key file demogram accepts a fake fingerprint value on the command line and another argument for the number of bits in it, limited to a maximum of 62 bits. Below, demogram is used to demonstrate some points in this article itself.

Another executable verigram is similar. It also accepts a fake fingerprint instead of hashing the key file, but this is always 128 bits, like the real fingerprint. The fake fingerprint is passed on the command line as two decimal numbers, representing the most significant and least significant 64 bits.

The key file

Lepron certificate version = v0.01
nonce = 1994143800002
The intent is to demonstrate authenticity of the following public key
in representing the long name that is derived from it by
cryptographic hashing of this file and the grammar file below,
which defines an unambiguous mapping from the hash value to the name.
The format of the following public key is specified by RFC7468 Section 13.
-----BEGIN PUBLIC KEY-----
MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA19AsDpPXA0EhTwiNk1sn
bwEkGOz/hjD1GHxeyOINHtk8agvntmRKGzq1bh+YZDxWtsig/xcmXSSZFTRL9ZUH
Qi89Thn6BOBISACtLwTtkSOJXKkwEqzxlC9mO9jSGfdQxdtYezlZSUNk3as1vH8u
piM51LtVCjzQSvbQfGL/edo13ZJQCK7Iso32Cz/U9BM6CIpTbtS5SQ/BSRyyO2c3
8b4Ptir36W0EP/0rj2aNoGvmLG0ttW702tc5m2XrnGZyko+74UsImQdHBMe3Pnql
YoruTgarku2yyJeOLx27/7kqZ9Q6lxfAeminCEruaSq34gHF0YVXsSg+0iqiYqit
h+bwY55bYMihJ8cNjzWkV52R4GC/Lvy6CuIAyxclks6jxKZFv4kV+d/2xyjndNXD
fa88PCZmItgaRjAiYrx6nuK+RlFsEXkLv+slcKGLuzNcXEdYw+74/Ta1/Qyeh8ef
uisIl5Epfz2Bx/cNxIxM7nZWStMHh7BOyQBit3WicLDtQb//S5023UC/dIRwOYY6
UJuCNx0a0HskIo6WkriCWyf+A7a2OGKL0XNFVzBhGo3xXWc9KelSEOApdovMZQcd
K8fdvDYvDFsiyP3HZkqDXGF4bDfVC+AiLOcYXsuYqsgBN20ZupMsuMt1ZPPR8AXZ
GccQC1BHjfGk+UzofOQu5isCAwEAAQ==
-----END PUBLIC KEY-----
RIPEMD160('L'|RIPEMD160(gramar)|grammar) = 0c00a1a3352668ded548b0c3cf0d858e922a5de5
length(grammar) = 1702050

Figure 5. A key file in the prototype implementation. This example generates the name “Stephen Olyndforbra Clanivelle Cloged Hewitt”. In the final design Lepron will include additional fields, discussed in the text. Fields shown here are the nonce, the public key and the cryptographic hash of the grammar file. Lepron's security proposition is that the contents of this file are protected and authenticated by the owner's use of the corresponding name.

Figure 5 shows an example of a key file. that generates the name “Stephen Olyndforbra Clanivelle Cloged Hewitt”

The grammar file

The grammar file defines a mapping from number to name. It is called a grammar file in this implementation because most of its contents consists of a grammar partitioned into two sections.

The motivation for choosing a grammar and specifically a CFG for the formalism is the hope that it is flexible enough to allow the description of models for many languages, and to do so in ways that do not necessarily need to be foreseen.

For example, the grammar file olynd.gram contains three different types of word model. These are a trigram letter model, a hand written grammar for short words and a list of around 102,000 names.

In the current implementation, a name is defined as a sequence of words. The first section of the grammar generates individual words under control of the second section which assembles the words into a sequence. A final, third section of the file specifies permutations which change the order of the words from the order in which they are generated to the order in which they appear in the name. The motivation for this will become apparent.

pseudoword = start middle end #;
start = { b "br" d t p };
middle = { a e i o u };
end = { d n "ll" s "rd" t};
lion="lion";
tabby="tiddles";
cheetah="cheetah";
cat = {lion(2) tabby(5) cheetah(3)};
lastname =  cat #;
%%
name = lastname pseudoword maybemore;
maybemore = {{ #() pseudoword_eow }};
pseudoword_eow = pseudoword #;
%%
(2 1)(2 3 1)

Figure 6. A toy example of a grammar file showing three sections separated by ‘%%’.

~/lepron # ./demogram toy.gram 7 5
Ngrammar says  grammar is acyclic
ls 64bits 7
debug grammar nencode taking on board initial 5 bits
Input raw/commission 7 / 32
Output raw/commission 7 / 8; raw change = -0
lion  = 2.000 bits (0.400 bits/char)
----------------------------
Input raw/commission 7 / 8
Output raw/commission 0 / 1; raw change = -7
pad  = 3.000 bits (0.750 bits/char)
----------------------------
name size including spaces between words 9
name entropy = 5.000 bits (0.556 bits/char)
Generated name = [lion#pad##]
Permuted name = [pad#lion#]
Name generation successful.  The passed raw value generated the following name:
Pad Lion
~/lepron #

Figure 7. The demonstration application demogram generating a name using the grammar shown in Figure 6.

The general syntax of the grammar file

The grammar file describes a particular name grammar using another grammar that might therefore be called a meta-grammar. It is important to avoid confusion between the name grammar and the meta-grammar.

The syntax of the grammar file is described here informally, mostly using examples. A more formal definition of this meta-grammar is the YACC code that is used to parse the grammar file in this prototype software, yargram.y. This description assumes familiarity with basic parsing terminology such as “nonterminal”.

The first section of the grammar is a probabilistic context free grammar (PCFG) or weighted CFG. The second section of the grammar can also be annotated with weights, but their meaning is different and it is not a true PCFG.

The type of CFG in both sections of the grammar is restricted: no recursion or repetition is allowed. The language generated by such a grammar is finite. It generates a finite number of sentences and the length of each sentence is bounded. A further simplification is a prohibition of null productions: no nonterminal may expand to nothing.

The first two sections of the grammar file are essentially a list of grammar rules, in a variant of BNF form. Some examples of these rules are illustrated here with equivalent rules in the well-known syntax of YACC.

At least one nonterminal defined in the first section must represent a word. A rule in the second section of the grammar can include such a word-generating nonterminal. In contrast a rule in the first section cannot include anything from the second section. The result is that in any parse tree, nonterminals from the first section appear under nonterminals from the second section, but not vice versa. The motivation for this restriction is to avoid ambiguity in the grammar, explained later.

In the current implementation the terminals are restricted to the 26 lower-case letters of the alphabet (a-z) and the special symbol ‘#’. A future implementation, able to generate names in all the languages of the world, would need to allow additional terminals in the grammar, presumably using Unicode. A word is defined as one or more letters followed by #, for example “lion#”. The string generated by the second section must consist of one or more such words followed by #, for example “lion#pad#bad##”.

Permutations from the generated order of words to the order in the name are specified by a list in the third section of the file. Each list of numbers in brackets is a permutation for a name containing the same number of words as the number of numbers in the brackets.

For example, in Figure 6 “(2 3 1)” applies to names consisting of 3 words. Its effect is to put the second generated word first, the third generated word second and the first generated word last. So the generated example “lion#pad#bad##” would be permuted to “pad bad lion”.

The final name is defined as the capitalised version of this string, so in this case it would be “Pad Bad Lion”.

As a notational convenience, each of the terminals is represented in the grammar file unquoted as itself. These correspond to YACC's “literal” terminals in single quotes.

For example the sequence rule:

abc = a b c;

is equivalent to YACC:

abc : 'a' 'b' 'c';

As another convenience, a string of letters in double quotes, such as "abc", is a shorthand for a sequence of those letters. So this example could be equivalently written as:

abc = "abc";

With this exception of a single lower-case letter, any string of letters is a nonterminal which must be defined on the LHS of exactly one rule.

The grammar file first section

In the first section, each rule is either a sequence rule or a choice rule. In both cases, the RHS of the rule consists of an ordered list of terminals and nonterminals. The difference is that the choice rule encloses the list in {} curly brackets as the following examples show.

A sequence rule: year = spring summer autumn winter;
A choice rule: season = {spring summer autumn winter};
A choice rule with weights: bird = {owl(1) robin(4) thrush(2) sparrow(3)};

During word generation sequence and choice nonterminals are expanded differently. The nonterminal defined on the LHS of a sequence rule is always expanded to the fixed sequence on the RHS. The nonterminal defined on the LHS of a choice rule is expanded to one element chosen from the list on the RHS. The choice is made according to the numeric value being encoded, using a form of range encoding described later.

The sequence rule equivalent in YACC: year : spring summer autumn winter;
The choice rule equivalent in YACC: season : spring | summer | autumn | winter;

The range encoding is the motivation for the second form of choice rule shown, where every element in the RHS is annotated with a weight. The meaning of the weight value is essentially the relative frequency with which that element will be chosen during expansion of this rule. So in this example sparrow would be chosen 3 out of 10 times, and robin 4 out of 10 times.

Choice rules like the first one with no weight annotation have implicit weights. The weight assigned to each is the count of the different strings that could be generated under that symbol. For example in the toy grammar in Figure 6 we have the following rule.

pseudoword = start middle end #;

Now for example

end = { d n "ll" s "rd" t};

This means that there are 6 different possible strings that could be generated when end is expanded. So the weight assigned to end in the rule would be 6. The same applies to the other symbols, which means that in Figure 6 the first rule is exactly equivalent to the following:

pseudoword = start(5) middle(5) end(6) #(1);

Note that the number of strings generated by a nonterminal is counted disregarding any weight annotations encountered while expanding it.

Range encoding in the PCFG for a word

The PCFG in the first section of the grammar file works in the following way. As for any CFG, generation of a word proceeds by expanding nonterminals, replacing each with the RHS of its defining rule. Throughout this process, the range encoder maintains and modifies two integer values, which determine the choices made in production rules. One of these is the “commission”, so called because it can be considered as the number of values that the current rule has been “commissioned” to handle. The commission is the number of different suffixes that should be available to complete the string of terminals generated so far.

In other words, it is the number of cases that the PCFG still needs to handle. The second integer, called the “raw” value, is a zero-based integer denoting the particular case being encoded within this set. Hence:

0 <= raw < commission

At every point in the grammar where there is a choice to be made - ie at the expansion of every choice nonterminal - the PCFG works by partitioning the commission into ranges of values. The size of each partition is proportional to the weight marked in the grammar for that choice.

For example consider the following grammar rule.

cat = {lion(2) tabby(5) cheetah(3)};

Suppose the commission was 1000 at the point of expanding this rule. Then the weights would be scaled up to lion=200, tabby=500, cheetah=300.

Then the value of raw would be used to decide the choice of cat. A value of raw in the range 0-199 would be a lion, a value in the range 200-699 would be a tabby and a value in the range 700-999 would be a cheetah.

Suppose raw = 750. Then the cat would be expanded to a cheetah and the new value of raw would be 750 - 700 = 50 and the new value of commission would be the range of the cheetah = 300. If instead raw = 100 then the cat would be expanded to a lion, the new value of raw would be 100 and the new value of commission would be the range of the lion = 200.

The full scaling rule is more complicated than this, because it also has to cope when the commission is not an exact multiple of the sum of the weights in the choice rule. The full rule is that the weights are scaled up by the largest integer that does not make their scaled sum exceed the commission. The scaled range of the first nonterminal in the rule is then increased by adding to it any excess of the commission over the sum of the scaled ranges.

For example, suppose the commission had been 32. So in this case the scaling would be x3 and the sum of the scaled ranges would be 30. The excess commission in this case would be 2, so we would then add 2 to the lion. The final ranges would therefore lion = 8, tabby = 15 and cheetah = 9

Handling the remainder in this way distorts the probability language model of the grammar but it does not affect security. This distortion is ignored in this implementation because the commission is usually very much larger than the integers in the weights which makes the remainder added relatively small.

Another effect related to partitioning at low values of commission happens when the commission is smaller than the sum of the weights on the RHS of a choice rule. The current range encoding allows this, meaning that choices towards the RHS of the rule might be unreachable. However, this situation is the motivation for a “raising” rule, which raises the commission to equal the sum of the weights, explained later.

By this repeated partitioning the remaining commission is reduced at every choice node until it becomes 1 and there are no more cases to consider and generation of the name is complete. Note that the raw value itself is never rescaled. It is reduced to zero purely by successive subtractions.

A more formal definition of what expansion of a nonterminal or terminal means will now be given. This definition is recursive, meaning that expansion is defined in terms of itself.

To expand a nonterminal defined by a choice rule, a symbol is chosen from the RHS following the procedure just defined, with the corresponding modification of the raw and commission values, and then the chosen symbol is expanded.

To expand a nonterminal defined by a sequence rule every symbol on the RHS of the rule is expanded in turn, from left to right.

To expand a terminal, the value of the terminal is appended to the string generated so far. The raw and commission values are not changed.

Generation of a word means expansion of the nonterminal that represents the word.

Range encoding a word with 64-bit arithmetic

Although the key fingerprint is 128 bits, the current implementation manages with arithmetic of just 64 bits. Since this is widely supported by current compilers there is no dependency on a large number library such as GMP.

This is achieved by limiting the commission that is used to generate each word to 62 bits and breaking the fingerprint down by bit-shifting rather than arithmetic. Before generation of a new word starts, the raw value is topped up to 62 bits by shifting in more bits from the fingerprint (while there are any left).

A note on range encoding and parsing

It is possible to parse a name to derive its corresponding fingerprint. Although parsing is not necessary for authenticating a name, the interaction of range encoding and parsing is explained here to clarify the range encoding algorithm. A parser that works in the way described here has been implemented, essentially by auto-generation of a YACC grammar from the word grammar but it is not included in the prototype software here.

For a particular word, given the value of the commission at the start of the word, the parse of the word will reveal both the commission at the end of the word and the reduction in raw value at the end of the word. However, the absolute raw value cannot be determined. (At this point a lower and upper bound for its value could be calculated but this is pointless, as it can be fully determined later anyway).

At the start of parsing, the initial commission at the start of the name is known, but the raw value (the fingerprint) is completely unknown. But at the end of the name the raw value must be exactly zero.

Therefore recovery of the original fingerprint requires first a forward traversal and then a reverse traversal of the complete name. The forward traversal parses the name word by word, rediscovering the commission at the end of each word and the change in raw value. The subsequent reverse traversal rediscovers the raw value at the start of each word by adding back the known decrease caused by the word to the known value at the end of the word.

During forward traversal the commission at the start of a word will be more than the value at the end of the previous word if more fingerprint bits are shifted in. The number of bits shifted depends on only the commission and the number of bits remaining in the fingerprint, both of which are known at every point.

On reaching the end of the name, the parsing range coder should check that the commission is exactly 1. This check will detect a name that might be syntactically correct under the CFG but still invalid. Such would be a name that could never have been generated by the range encoder, either because the encoder would have run out of commission before the end of the last word or because there is still residual commission (above 1) after the end of the last word. Conceptually, the CFG itself knows nothing of the weights and probabilities assigned in the PCFG. These can be thought of as instructions for a generator about choices it should make. These instructions are in addition to the rules of the CFG, so that the set of legal names is a subset of the language generated by the CFG.

On reaching the end of the name, the raw value is known to be 0, so by adding back the decrease in raw caused by the last word, the raw value at the start of the last word can be found. This can be repeated backwards along the name.

At the start of each word, the number of bits that had been shifted into raw from the fingerprint is known because it was discovered in the forward traversal. So the corresponding number of bits can be shifted out from the raw value to the fingerprint, the reverse of the bit shifting that happened during generation. In this way the fingerprint is regenerated during the reverse traversal of the name. At each word, the reduction in raw value is known from the forward traversal so this is added back to raw to get the raw value at the start of the word.

The “raising” rule in range encoding

In the first section grammar, the RHS side of a rule may be prefixed with a special symbol ‘^’. This symbol is an instruction when expanding the nonterminal on the LHS to raise the current commission value to make it match the count of every possible string generated by the expansion. The raw value is not changed. If the commission is greater at this point than the count of possible strings, then name generation is aborted with an error.

The motivation for this construction is that it is expected to be used on the last word generated in a name, where the intent is to use all the remaining commission in that word. It is also expected no choice rule under this nonterminal will be be annotated with weights. These conditions would mean that every value of raw would map to a single string generated by the expansion of this nonterminal.

In order to consume all the remaining commission every possible value of raw has to generate a unique string in the expansion. With the range encoding rules as defined in the section above, this might not happen unless the commission is equal to the number of strings because of the way that remainders are handled when scaling up the ranges in a choice node. Essentially the range encoding produces suboptimal allocation of the cases between the elements of a choice rule, not reflecting the true number of strings that can be generated by each element.

pseudoword = start middle end #;
start = { b "br" "cr" d f t p };
middle = { a e i o u };
end = t;
%%
name = pseudoword#;
%%
(1 2)

Figure 8. A simple a grammar file flaw.gram where the range encoding algorithm described in the section above will fail to produce a string for some of the 32 values of a 5-bit input even though the grammar can generate 7 x 5 = 35 different strings. This happens because cases are misallocated between the initial letters in choice rule start. The letter ‘b’ is allocated 8 cases and each of the other letters 4 cases, even though they can all handle exactly 5.

~/lepron # ./demogram flaw.gram 3 5
Ngrammar says  grammar is acyclic
ls 64bits 3
debug grammar nencode taking on board initial 5 bits
Input raw/commission 3 / 32
Output raw/commission 3 / 4; raw change = -0
bat  = 3.000 bits (0.750 bits/char)
----------------------------
error 49 residual commission after all generated is not 1
~/lepron #

Figure 9. Running demogram with the grammar in Figure 8 and command line options raw = 3 and 5 bit commission to demonstrate a failure in range encoding in the absence of a raising rule.

Figure 8 shows an example grammar where the nonterminal pseudoword can generate 35 different strings yet range encoding fails to generate a string for 4 cases out of the 32 cases in a 5 bit fingerprint. Essentially this is caused by suboptimal allocation of the cases between the elements of the earlier choice rules because of the rule of the handling of remainders when the commission is not a multiple of the total weights of the choice rule.

Figure 9 shows demogram running with the grammar file from Figure 8, a value of raw = 3 and 5 bits of commission. This value is one of the 4 values of raw that fails to generate a name using this grammar. The following is an explanation of the failure shown. In this case, when expanding the nonterminal start the commission is 32 but the range of start is only 7 (each of its 7 elements generates one string and so has a range of 1). The range encoding algorithm therefore scales up these counts by 4, the largest integer that does not make the scaled total exceed the commission. Their scaled total is then 28 and the excess of the commission is 32 - 28 = 4. This excess is assigned to the first element in the list, 'b', so its scaled range becomes 4 + 4 = 8. This means that raw values 0 - 7 will select 'b' as the first letter of the word.

In these 8 cases, range encoding then reaches the next nonterminal to be expanded, middle, with a commission value of 8. Now the total range of middle is only 5, since each of its choices has a count of 1, but this is too big to scale up without exceeding 8. The excess, 8 - 5 = 3, is therefore added to the first element on the RHS, ‘a’, making its scaled range 4.

In expanding middle raw values in the range 0 to 3 will select ‘a’ as the second letter in the word and the commission will be reduced to the range of ‘a’ = 4. In this example raw = 3, so this happens.

Next the expansion of end causes no change in commission or raw as it is not a choice rule. After this in Figure 9 demogram prints “Output raw/commission 3 / 4; raw change = -0” as the raw and commission values at the completion of the word. Since this is the only word in the grammar the name is complete. The check at the end of name generation that all the commission has been used now fails, because the commission is now 4, and demogram prints the error message “error 49 residual commission after all generated is not 1”.

Now changing the grammar so the first rule is instead a raising rule

pseudoword = ^start middle end #;

solves this problem because raising the commission to be equal to the number of strings generated by the rule means that at every subsequent choice rule the commission is always a multiple of the range of the rule and the misallocation caused by the remainders does not happen.

~/lepron # ./demogram raising.gram 3 5
Ngrammar says  grammar is acyclic
ls 64bits 3
debug grammar nencode taking on board initial 5 bits
Input raw/commission 3 / 32
Output raw/commission 0 / 1; raw change = -3
bot  = 5.000 bits (1.250 bits/char)
----------------------------
name size including spaces between words 4
name entropy = 5.000 bits (1.250 bits/char)
Generated name = [bot##]
Permuted name = [bot#]
Name generation successful.  The passed raw value generated the following name:
Bot
~/lepron #

Figure 10. Running demogram again with command line options raw = 3 and 5 bit commission but with the grammar from Figure 8 modified to make pseudoword into a “raising” rule.

In this example the commission is raised to 35 and at the expansion of start the range of each element on the RHS will be scaled up not by 4 but by 5, making 7 x 5 = 35. So regardless of which element is chosen on the RHS it will always have a range of 5. This commission then always matches the range of the next nonterminal to be expanded, middle.

The raising rule in effect pretends that the commission is more than it actually is. In this example it means that there are imaginary cases (raw = 32 to 34) that will actually never occur. This might affect the language model and reduce the entropy carried by the word but clearly cannot harm security.

Grammar ambiguity detection

As explained it is fundamental to the security concept that only one hash value maps to a particular name. The hash value decides the parse tree, so it is fundamental that only one parse tree can generate the name.

By definition, an ambiguous CFG has more than one parse tree for a particular string. For the general case, it is not always decidable whether a CFG is ambiguous. For Lepron the prohibition on recursion and restriction to finite CFG should mean it is always decidable. Ambiguity detection is implemented only for the particular name that is generated. The checking does not check whether the CFG is globally unambiguous, but only whether there is more than one parse for the name being checked.

The weights in the PCFG are ignored during ambiguity checking.

The current implementation works as follows. The grammar is represented as a DAG where each node is either a terminal or a nonterminal in the grammar. An arc (or edge) from node1 to node2 means that node2 occurs on the RHS of the grammar rule defining node1. Conceptually each arc is labelled with the position of its destination node in the RHS. Terminals have incoming arcs but no outgoing arcs.

Ambiguity detection is implemented only for a particular name. It is simplified by the prohibition on null productions in the grammar. A node in the DAG can answer a query like the following: “Tell me the length of every parse you can generate that matches the current name starting at character position 3”.

The response is a set of lengths for the given starting position. There can be no duplicates in this set because if more than one parse of the same length is found, then this already means an ambiguity and checking stops. The response is memoised, meaning it is cached at this node. Any subsequent query to this node for the same starting position will be answered from the cache.

For a terminal node, the answer is simple. For example if the node represents letter 't' then the node looks at the requested character position in the name (in this example 3) and if it is a 't' then it returns a parse set with one member of length 1. Otherwise it returns an empty set.

For a nonterminal node the answer needs related queries to nodes at the ends of its outgoing arcs. For example, given the query above, a sequence node would first pass the same query to the leftmost node of its RHS. If that node returned a set with a single length, say 4, then the node would add that length and query the second leftmost node: “Tell me the length of every parse you can generate that matches the current name starting at character position 7”.

Incidentally there is no obvious way of turning this implementation into a parser. At the time of generating a parse set, each node does not know which parse, if any, will be used in a final parse of the complete name. Then later, when a successful final parse is found at the sentential node, it is not obvious how to discover which parses were included in it from nodes further down the DAG. Fortunately parsing is not necessary for name authentication, since we start with the hash value and generate the name, rather than going back the other way from the name.

The grammar file second section

The second section of the grammar is the phrase-generating part, which orchestrates the selection and expansion of word nonterminals in the first section in a defined order. Decisions made by the second section are based only on the remaining commission.

The total commission as defined above is 2¹²⁸ at the start of generation of a name. This is so because this prototype design uses 128 bits for the key fingerprint, derived from a truncated hash of the key file. Now, an ngram-based word generated by a PCFG as described above may use a variable amount of commission. This means that a variable number of words may be needed in a name to use all the commission.

In addition different kind of words might be wanted, generated by different nonterminals in the first section. For example, when most of the commission has already been used then it might be preferable to generate a short final word rather than a long one. If there is a particular nonterminal in the first section for generating a short word, then the second section can select this at the appropriate point.

The function of the second section of the grammar file is to make these decisions and its choices are controlled only by the remaining commission.

The second section may contain sequence rules identical in syntax and meaning to the first section. But instead of the first section's choice rules it has instead commission rules. A commission rule has double curly brackets around its RHS.

maybemore3 = {{ #() firstname_eow() more3 }};

The syntax is that every element within the RHS must be followed by round brackets except the last which must not have brackets. The brackets may optionally contain a weight. The default weight assigned to empty brackets is the count of all possible strings that could be generated when that symbol is expanded.

For example “#()” is identical in meaning to “#(1)” because the terminal # represents one string.

A choice is made at a commission rule, but it is a choice based on the current value of the commission rather than the current value of raw. The choice is to expand the symbol with the lowest weight that is not lower than the current value of the commission. If all of the weights are lower than the commission, then the rightmost symbol is expanded. This rightmost symbol, without brackets, is considered not to have a defined weight within this rule (regardless of whether the same nonterminal has an implicit or defined weight elsewhere).

In the “maybemore3” example above if the commission had been reduced to 1 (ie already all used) then the '#' symbol would be chosen (ending the word) rather than a nonterminal that generates another word.

Grammar ambiguities in the second section do not matter

This section will prove that it is not possible for the second section grammar to generate the same name from two different key fingerprint values f1 and f2, regardless of its own parse ambiguities.

The argument essentially says that the generation path through section 2 can diverge for f1 and f2 only after the fact of the partially generated f1 and f2 names already being irreconcilably different.

Recall that a word nonterminal defined in the first grammar section and used in the second section must:

generate exactly one word when it is expanded and
have only one parse tree for each different word

The argument considers a name generation path through the grammar of the second section. Along this path, decision points - forks in the path - are only at the commission rules. And in a commission rule the choice of path depends only on the current value of commission.

Now either the paths through the second section grammar are the same for f1 and f2, in which case it cannot introduce any ambiguity, or, if they are different, then the first point where they diverge must be at a commission node. When their paths diverge, the commission at this point must be different for the f1 case and the f2 case. At this point the same number of words must have been generated in both cases because the first section is constrained to generate one word for each nonterminal expansion and so far section 2 has specified the same sequence of nonterminal expansions.

Now if the commission is different at this point between the f1 and the f2 cases, it follows that at least one of the words already generated must be different between the f1 and f2 cases. We know this because the section 1 PCFG cannot use a different amount of commission unless it has generated a different parse tree and a different parse tree must generate a different word because the section 1 PCFG is unambiguous.

Now since both partially generated names have the same number of words neither of them is a prefix of the other and it is not possible by appending more words to ever make them the same. They are irreconcilably different.

So both in the case where the paths diverge and the case where they do not, it is not possible to generate the same name for two different fingerprints.

The grammar file third section

As noted the purpose of the third section is to specify a permutation of the words in the name from the order in which they were generated to the order they appear in the name. An empty permutation list is not legal syntax in the current implementation, but it is always possible to have a list that has no effect, for example:

(1 2)

This is a list with only one permutation. That permutation applies only to names with two words, and it keeps the words in the same order, so it effectively does nothing.

The design motivation for permutations can now be explained. The words are generated by the first two sections with monotonically declining value of commission and the last word generated might be chosen to be a short word, as previously noted, to match a low remaining commission, without making the name longer than it needs to be. But we might not want the short word as the last word in the name. So the third section of the grammar file provides a way to move it to a different position.

Note that the permutation that is in effect depends solely on the number of words in the name, so it is unambiguous and can always be deterministically reversed.

The meaning of probability in a PCFG

In NLP ambiguous CFGs are sometimes used with probabilities assigned to different parses of the same string. In contrast, the CFG here is unambiguous. It has probabilities for different productions during generation of the parse tree, but it is deterministic in that each string in its language has only one parse tree.

A trigram model in a PCFG

The trigram letter model can be thought of as a state machine where the state is determined by the last two letters emitted. The probability of emitting a particular letter next depends only on the current state and the letter.

This can be modelled in a PCFG with a nonterminal corresponding to each state. [JOHNSON] In this implementation, a pair of nonterminals rather than a single nonterminal is used for each state because of the restriction of every nonterminal to be either a choice or sequence rule.

Lepron does not allow recursion in the definition of a nonterminal, which means every character position in the word requires its own set of states. If there are 26 letters, then there are potentially up to 26x26 states at each character position. In fact, in this implementation there are less than this because letter combinations that never occur are not included in the grammar.

Figure 11 shows an example of a parse tree. A nonterminal representing a state is named after the last two letters emitted, with a trailing number that increases with character position in the word.

Figure 11 A parse tree of the string “doves#” in a PCFG for a trigram letter model of English.

A list of grammar rules used in this parse tree is shown in Figure 12. (These are from the grammar file trigram.gram, edited to delete some of the nonterminals not used in this example)


pseudoword = {_a1(3610) _b1(3822) _c1(6523) _d1(4321)};

d1 = {_da1(277) _de1(1568) _dh1(5) _di1(1421) _dj1(3) _do1(429) _dr1(346)};

_d1 = d d1;

_do1 = o do1;

_ov2 = v ov2;

_ve3 = e ve3;

_es4 = s es4;

do1 = { _ol2(67) _om2(81) _on2(63) _oo2(61) _op2(18) _or2(66) _ov2(15)};

ov2 = {_va3(45) _ve3(632) _vi3(81) _vo3(15) _vr3(1) _vu3(5) _vv3(1)};

ve3 = {_ea4(7) _ec4(12) _ed4(6) _es4(80) _et4(46) _ew4(2) _ex4(11)};

es4 = {#(6440) _sa5(46) _sb5(10) _sc5(262)};

Figure 12 A grammar fragment that can produce the parse tree in Figure 11.

The nonterminals “_do1” and “do1” are an example of a pair representing a state. The nonterminal “_do1” can be thought of as a transition which emits 'o' and then enters state “do1”. This nonterminal/state is labelled “do1” because the last two letters emitted were ‘d’ then ‘o’.

The do1 choice nonterminal models what happens next. Since we know we have just emitted a ‘o’ then all the states immediately following have labels starting “o”. The weights in the grammar on each of them correspond to its probability. So in this case _ov2(15) means there was a 15/371 chance of emitting ‘v’ as the next letter after this “do”

A trigram letter model of English was introduced in a previous article [TRIGRAM] which included software for training the model. The trigram related part in the example grammar file is derived from this with some minor changes and the rules are auto-generated although the auto-generation code is not included here.

Further work

There is scope for further work both on this implementation and on the concept itself. The following is an incomplete list.

Although this implementation can authenticate the contents of the key file there are no tools yet to extract and use the public key. The key can instead be extracted with a text editor.

Security and real world hash functions

In the real world, random oracles [ROM] do not exist. Further work is required to investigate what proven security guarantees can be made using real-world hash functions and what hash functions and constructions using them would be required.

In the prototype there are two uses of a hash function, one that generates a digest of the grammar file, which is then included in the key file, and one that generates a digest of the key file, which is then used to generate the name. In this implementation they are the same function and other possibilities remain to be investigated.

For example, the hash function digesting the grammar file could perhaps be a provably-secure number theoretic construction. Even if the digest value was large, this might not matter because it is going to be hashed again by the second function acting on the key file.

The hash construction used in this prototype for both the grammar file and the key file is as follows:

hash value(x) = RIPEMD160('L'|RIPEMD160(x)|x)

Currently this arrangement lacks a formal analysis or justification. Further work is required to find a formally-motivated construction.

As an example of potential dangers, the theoretical “herding” or “Nostradamus” attack on MD hashes (Merkel-Damgård) described in [HERD] would be directly applicable if Lepron used a simple MD hash with a digest narrow enough for the adversary to find collisions. The weakness of an MD hash is that once two different files produce the same internal hash value (ie collide) they will remain in collision if they are extended by the same block appended to each of them. The so-called “MD strengthening” design appends a special last block to the pre-image. This special block contains the length of the pre-image. This will take away a collision in the final digest only if the two files are different lengths. In this attack the files can be kept the same length.

The paper describes a prepared set of hash values as a “diamond” but “funnel” seems more appropriate. The mouth of the funnel is 2^k wide and once an intermediate hash value collides with any of its values, a specific sequence of blocks that the adversary prepared earlier and that the adversary now appends will funnel it to the same final hash image.

In the case of Lepron, the adversary could prepare such a funnel (say as the last part of the grammar file) and then, knowing the final hash digest, put that value into the key file and calculate the resulting key fingerprint, then write a PCFG tailored to map that fingerprint to the target name. The tailored PCFG would be the “prefix” of the paper. There is then some additional work to join the prefix to the mouth of the funnel as described in the paper. In fact the same funnel could be used to attack multiple target names by writing multiple PCFGs with only the additional work of joining each to the funnel.

Additional fields in the key file

The following proposed additions to the key file are motivated by increased security. As discussed in [GLOBALNAMES] with n = 128 an adversary may have only a 1/2¹²⁸ chance of finding a collision with a specific name on every hashing operation but if there were (say) 2³⁰, a billion or so, names in existence then finding a collision with any one of them would have a probability of 1/2⁹⁸ for each hashing operation. With a powerful potential adversary, this might not have as good a margin of safety as we would like.

The following are possible additional fields in the key file.

additional info - a free-form field where the owner can write information, such as a personal name, date of birth, business location and history and many other things.
key usage - imposing limitations on the way the key is intended to be used, for example ruling out weak hash functions for signing with this key.
creation date - the date of creating the key file, presumably not long before the name is first published.
creation date not-before evidence “The Times 03/Jan/2009 Chancellor on brink of second bailout for banks”
name commitments - this would mean a new syntax to express restrictions on the final long name. For example with the names in Figure 2 the creator of the name would commit to its first word being “Stephen” and the last one “Hewitt”. With a formal syntax this could be checked automatically by namekey.

With the exception of “key usage” the motivation for these fields is that if most people or organisations usually write something then an attacker either has to leave them blank, arousing suspicion, or fill them in, drastically reducing the number of potential collision targets for which the key file is plausible. The idea is analogous to salting a password before hashing.

International support

The system needs to support UTF8 for writing names in languages other than English. In the current implementation, the alphabet is restricted to ASCII a-z. Many languages using the Roman alphabet need diacritics. Another approach to consider, for these languages only, is simply that the diacritics are removed when the name is authenticated. Names that differ only in diacritics would be considered to be the same name.

Unicode will potentially introduce the well-known problem of ambiguities caused by different Unicode characters having the same or similar appearance. The problem is that what looks like the same name could then represent multiple possible hash values, breaking the security principle defined above, that a name represents only one hash value. It is already a known problem [UNICODE] and a solution will have to be decided.

The current implementation is bound to the concept of words. Further work is needed to understand whether this can work with Chinese and languages which perhaps do not have a pronounced concept of words. Alternatively it might be possible to remove the concept of words from the system and leave it to be an artefact of particular grammars.

Range encoding revisited

The current implementation treats range coding in a weighted and unweighted CFG in the same way. The only difference is that before range coding starts, weights are assigned to the unweighted CFG based on the string count under each nonterminal, whereas weights are already explicitly marked in the PCFG. During range coding the coder is oblivious to whether the grammar was originally weighted or not. However, there might be advantages in treating the two cases differently during range coding.

This might remove the need for the “raising” rule. It might also allow the PCFG to use the arithmetic encoding of [WNC] or related techniques, which might allow it to be faster and require less arithmetic precision.

Language models

There is scope for further work on the language models. A more linguistically motivated model than letter trigrams could be investigated, such as one based on theories about syllables. This would yield information about whether a PCFG is a good formalism for expressing these models.

There is scope to investigate improvements to the trigram model too. For example the English word grammar in this implementation is based on dictionary frequencies rather than corpus frequencies and whether this is the best choice is unexplored.

New languages can be tried. Initial experiments with German in the trigram model did not seem impressive. If this is confirmed it might be because the trigram model does train well when it is oblivious to boundaries in compound words.

This implementation includes small hand-written grammars for short words, one in Italian and one English. These were developed by trial-and-error and stepwise improvement. The approach was to write some initial simple rules, generate random words and investigate the words that looked wrong and add more detailed rules to exclude those. The potential of this trial-and-error method has not been fully explored. There might be scope to develop tools that assist with this.

For personal names, it may be that a different grammar to general pseudowords would work, without having to resort to a list of names. This would have the benefit of a higher entropy for the name.

A variable number of bits.

In the current implementation, there are exactly 2¹²⁸ possible fingerprint values. An alternative would be to specify a minimum. An ngram-based word generated by a PCFG as described above takes a variable amount of commission and therefore uses uses a variable number of bits. For this reason the grammars in this implementation are designed to allow a variable number of words and to make the last generated word a non ngram-based one that has enough cases to always use up all the remaining commission.

However an alternative might be to allow the last word instead to be an ngram-based word and to continue to its unpredictable end, taking more bits from the hash value as needed, even when the resulting name used more than 128 bits. So a name might use 128 bits from the hash or it might use a more.

A disadvantage might be that the names on average would be longer. It remains to be investigated whether a system with optional use of more than 128 bits would have more security than one with a fixed 128 bits.

Free open source GPL software to download

The complete source licensed under GPLv3 is available to download. It is experimental software that has been only lightly tested. It is intended for experimenting with names, grammars and for further development. It comes with absolutely no warranty.

It is available in the zip file download/lepron-r3.zip . This file has SHA256 hash 0bece19460ce683a27444bfa5177f6a792830e08d1d83972139c029623e64a2e. The hash of each individual file is listed in the included file sha256sum.txt which itself has hash 0b890e62d2d082abab2a216abfe006ae9dd3fa6a103d9dc799674105924a263c.

The source code builds seven separate command-line applications namekey, newkeyfile, miner, countri, pring, demogram and verigram. On GNU/Linux they are all built by typing “make” at the command line, which will use the supplied file makefile.

The main application is namekey, which generates the authentic name, given a key file. The others support this function with actions which include creating a new key file, creating a new name, creating new grammar rules, exercising a grammar file.

They have no dependencies on external libraries.

demogram

This is for experimenting or testing a grammar or partial grammar. It behaves like verigram (qv) except that it accepts a number specifying the number of bits for the commission (max 62). This is useful for testing the grammar rules for a single word. Figure 7 shows an example of its use.

namekey

It requires two file names on the command line, first the key file and second the corresponding grammar file. It checks the hash digest and length of the passed grammar file against what is specified in the key file. It then generates the name using the hash digest of the key file as numeric input to the PCFG in the grammar file. Figure 5 shows namekey generating a name from the example files in this implementation.

newkeyfile

A simple utility for convenience in creating a new key file. You pass it the file name of the proposed grammar on the command line and it prints out a corresponding key file initialised with the hash digest and length of that grammar file and a long nonce of all zeros. In this implementation, the public key included is my key but the key (and everything else) can be changed by editing the file.

miner

This is for mining new names. It is essentially the same application as namekey with some simple changes to make it repeatedly generate names, checking each one quickly to see whether it is one that is wanted. A limitation in this implementation is that the target name has to be hard coded by editing the source. This requires calculating the range of numeric values of hash digest that correspond to the desired words in the name. The current implementation is hard coded to search for names of the form shown in Figure 2.

Apart from this it has minimal changes from namekey. The changes are

printing nothing while it is generating a name
repeatedly incrementing the nonce in the key file and testing another name
checking each name even as it is partially generated and aborting generation of that name if it is unacceptable
printing the nonce whenever an acceptable name is found
keeping and printing some wall-clock time information

One of its limitations is that while it is rewriting the nonce it assumes the nonce area in the file is big enough and if it is not it will write past the end of the nonce, overwriting the next part of the key file. Its source code is the same as namekey except grampy.cc is replaced by grampy2.cc and namekey.cc by miner.cc.

For the first word in the name, it starts to check whether there is a match with the target word as soon as the hash digest is calculated, and aborts immediately if there is not. But if there is a match it decodes the rest of the hash against the grammar in the same way as namekey and then checks the last word generated.

The names in Figure 2 correspond to the following nonces in order: 3603410530003, 40610284855567000000, 34987502047000000000000, 33438895027000000000000, 1994143800002, 37415798817000000000000, 40525395097070000000, 40831628885270000000, 5463685230003, 3201406230003, 40515059107870000000, 13594615587000000000000, 40484207054870000000, 40216141010570000000, 40714831601700000000, 5171143290002, 3038540540003.

Some of these were found by slightly different versions of miner. The version here produces nonces with the last digit 7.

The nonce produced by miner is a constantly incrementing count. The command line option -p allows a prefix of a fixed sequence of digits to be prepended to the count. The intent of this is to allow multiple instances of miner to be run in parallel without any chance of them duplicating nonce values. A multi-core CPU could then be fully utilised by running a miner instance for each core. The different instances are given different values for the prefix so an invocation might be for example:

~/lepron # ./miner -p 406 new.key trigram.gram > found.log

pring

This prints out a ngram style grammar which can be used as part of a grammar file as understood by namekey, verigram and demogram. All the probability information for the ngrams comes from the header file 3counts.gh which is included in it at compile time. When it is run it takes no arguments and always prints the same thing. If you want something different you have to recompile it with a different header file.

Starting with just a list of 1,000 Italian surnames found on the internet, the makefile uses countri and pring to create italian.gram, the ngram-based Italian grammar which generated the words in Figure 3. The header file 3counts.gh is generated by countri by counting trigrams in the original list of Italian names. This header file is then compiled into pring and pring then prints out a set of grammar rules that is included in the first section of a generated grammar file, italian.gram. (Note that italian.gram also includes a hand-written Italian word grammar from grammar.tail. This is used to generate short words that mop up low values of remaining commission after the trigram generation. It is responsible for the words such as “di”.)

verigram

This is for experimenting or testing a grammar or partial grammar. It behaves like namekey except that it does not use a key file and instead accepts a value representing the hash digest of a key file. This value is passed on the command line as two numbers, representing the most significant and least significant 64 bits.

Terminology

CFG: Context free grammar
Commission: Defined in this article to be the remaining number of fingerprint values that are possible at any particular point in range-encoding
DAG: Directed acyclic graph
DNS: Domain name system
Fingerprint: A cryptographic hash or digest of a public key and related information
GMP: GNU Multiple Precision Arithmetic Library
GNS: GNU name system
GPL: GNU Public Licence
LHS: Left hand side
MD: Merkel-Damgård
NLP: natural language processing
PCFG: Probabilistic context free grammar
PKI: Public key infrastructure
RHS: Right hand side
RSA: Rivest Shamir Adleman
SDSI: Simple distributed security infrastructure - from the title of a document by Rivest and Lampson, see above.
YACC: Yet another compiler compiler

References

[CAREFUL]

Careful with Composition: Limitations of Indifferentiability and Universal Composability, Thomas Ristenpart, Hovav Shacham and Thomas Shrimpton, 2011.

In December 2023 a PDF file was available on eprint.iacr.org linked from page https://eprint.iacr.org/2011/339 with SHA256 hash 40b3adb366e29673cf2b2ec8b1dd0984ad51e6fc51dc7fce376f483b569b8f57.

[GLOBALNAMES]

An idea for public key authentication from a name without certificates or central authority, Stephen Hewitt, Cambridge Clarion, 14 May 2021

[GNS]

A Censorship-Resistant, Privacy-Enhancing and Fully Decentralized Name System, Matthias Wachs, Martin Schanzenbach and Christian Grothoff, 13th International Conference on Cryptology and Network Security (CANS), 2014

In August 2023 a PDF file was at https://git.gnunet.org/bibliography.git/plain/docs/gns2014wachs.pdf with SHA256 hash 007d9afde4ad7b640eea79650847500bbb21396be229b1a4fa24a753e2bc901e.

In December 2023 an abstract was available at https://link.springer.com/chapter/10.1007/978-3-319-12280-9_9

[HERDING]

Herding Hash Functions and the Nostradamus Attack, John Kelsey, Tadayoshi Kohno, 2006

In August 2023 a PDF file was available at https://eprint.iacr.org/2005/281.pdf with SHA256 hash 10cba27b9a6cc62dfb55000a64432d9bf39635255dcf81dd9289d15c8cd73d9d

[JOHNSON]

Probabilistic Context-Free Grammars and Beyond, Mark Johnson, 2007, slide 37/87

In November 2023 a PDF file of slides was available at on web.science.mq.edu.au, linked from page https://web.science.mq.edu.au/~mjohnson/Talks.htm with SHA256 hash 85ce0ddeaa4c51b3aec67b713111b7c1ee4386cb79dae5fbc54ec833eca79799

[LEPRON]

Lepron, a project to develop pronounceable pseudowords for representing binary strings - Part 1: introduction and goals, Stephen Hewitt, Cambridge Clarion, 2 July 2021

[REVISITED]

The Random Oracle Methodology, Revisited, Ran Canetti, Oded Goldreich, Shai Halevi

In 2023 there were various versions of this paper freely available on the web, dated with different years. The earliest date was 1998.

A version including two dates, 2000 and 2008, was in a PDF file at https://arxiv.org/pdf/cs/0010019.pdf, linked from page https://arxiv.org/abs/cs/0010019 and with SHA256 hash ac238abae217b63a0924fa9ae6407c86e2856cf1eab2e0cb0891014809a75dee

[RFC9498]

The GNU Name System M. Schanzenbach, C. Grothoff, B. Fix, November 2023

In December 2023 a text file was available at https://www.rfc-editor.org/rfc/rfc9498.txt with SHA256 hash 6c640e3705df9284eabddeb45f811ee5e56ec28f80f12ccc599d8c7c895ec799.

[ROM]

Random oracles are practical: a paradigm for designing efficient protocols, Mihir Bellare and Phillip Rogaway, December 1993

In December 2023 a PDF file was at https://dl.acm.org/doi/pdf/10.1145/168588.168596, linked from page https://dl.acm.org/doi/10.1145/168588.168596 with SHA256 hash 61f2224f2af5bfaaa586a6be0d17b9b2164410506e224b692bf7312ad9e0afdc.

[SDSI]

SDSI - A Simple Distributed Security Infrastructure, Ronald L. Rivest and Butler Lampson, dated 1996. “This paper is a working document defining SDSI version 1.1.”

In August 2023 a PDF file was at https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acrobat-1.pdf with SHA256 hash e7c3fa9a0a868f80062f2bd3d37186caff60955677fea0ad605e06edbee33609.

[TRIGRAM]

Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams, Stephen Hewitt

[UNICODE]

Unicode Technical Report #36 Unicode Security Considerations, Editors Mark Davis, Michel Suignard, 19 September 2014.

In March 2024 this was at https://www.unicode.org/reports/tr36/

[WNC]

Arithmetic coding for data compression , Ian Witten, Radford Neal, and John Cleary, Communications of the ACM, June 1987

In March 2024 a PDF file was linked from https://dl.acm.org/doi/10.1145/214762.214771 with SHA256 hash 1c52a8ca5af56efbbaecf10f8fde0a1e9da03ed58f0f3685d2d6140803d2f149.

Authenticating a public key from just a name - a feasibility study with GPL code

The concept

The security proposition

Justification

Further work

Motivation and use-cases

The threat model

Choosing a name

Related work

The prototype implementation

The key file

The grammar file

The general syntax of the grammar file

The grammar file first section

Range encoding in the PCFG for a word

Range encoding a word with 64-bit arithmetic

A note on range encoding and parsing

The “raising” rule in range encoding

Grammar ambiguity detection

The grammar file second section

Grammar ambiguities in the second section do not matter

The grammar file third section

The meaning of probability in a PCFG

A trigram model in a PCFG

Further work

Security and real world hash functions

Additional fields in the key file

International support

Range encoding revisited

Language models

A variable number of bits.

Free open source GPL software to download

demogram

namekey

newkeyfile

miner

pring

verigram

Terminology

References

Related articles

Authenticating a public key from a globally unique name continued

An idea for public key authentication from a name without certificates or central authority

Lepron project part 2: towards public key authentication without central authority using names made of pseudowords

Lepron project part 4: pseudowords with entropy of 3.5 bits/letter from trigrams

How to remember a 128-bit key using ‘constrained choice’

Experiences memorising cryptographic-strength secrets

Manual encryption with a one-time pad revisited

Empirical explorations of faster Fermat factorisation, part 3

External links