Empirical explorations of faster Fermat factorisation, part 3
by Stephen Hewitt  Published  Last updated
Introduction
This is the third in a series of technical articles presenting original work towards an opensource, free software implementation of the Fermat factorisation algorithm optimised for speed.
See the Related section below for the other articles in the series.
The first articles introduced the Fermat algorithm, some terminology and a simple approach of of trying every alternate integer as a big presquare. Part 2 reported that the publicly available software called YAFU was over 330 times as fast as the software based on this simple approach.
This article analyses a filtering algorithm used by YAFU that contributes to its speed. It reports experiments on this filtering algorithm. It then explores how discoveries from these experiments can be applied to optimise the speed of Fermat factorisation where this filtering is used.
It demonstrates this experimentally by returning to YAFU. For a chosen example of a number to factorise, factorisation time was reduced by a factor of more than 4 by changing only a single constant in the YAFU code.
A maths paper available on the web in 2022 stated:
Given the increasing power of computers coupled with symbolic mathematics packages such as Sage, there has been a growing interesting in analyzing empirical data to formulate conjectures. For instance, the journal Experimental Mathematics is devoted to this approach. This paper gives two examples that show statistical thinking can generate fruitful mathematical leads that led to theorems.
[BILISOLY]
The empirical approach here means trying to discover patterns from experimental data, rather than from mathematical reasoning, at least initially. This article reports the resulting conjectures or empirical rules without attempting to prove them.
For the purposes of practical fast factorisation, an alternative to proof may be to verify the conjectures exhaustively for the full range over which they will be used.
Terminology
 big square
 Successful factorisation finds b^{2}  s^{2} = N where b^{2} is called here the big square. See Part 1.
 GMP
 GNU Multiple Precision Arithmetic Library
 LSB
 least significant bit
 LSD
 least significant digit
 N
 The number to factorise, of the form pq where p and q are large primes; possibly the modulus of an RSA public key.
 presquare
 The root x of any square x^{2} will be called the presquare.
 RR
 reduction ratio, defined in this article.
 RSA
 Rivest Shamir Adleman
 small square
 Successful factorisation finds b^{2}  s^{2} = N where s^{2} is called here the small square. See Part 1.
 YAFU
 Yet Another Factorisation Utility. See Part 2.
The basic idea, filtering out impossible integers
We know that there are patterns of LSBs that cannot be those of a square. For example, the optimisation of Part 2 relied on the mathematical idea that bit 1 cannot be set in a square. But there are other bit patterns that can be ruled out as the LSBs of a square if we consider more bits. In fact the optimisation of Part 2 will be shown to be just a special case where the number of bits is restricted to 2.
In the GMP Library manual's description of its algorithm for identifying perfect squares is the following:
A significant fraction of nonsquares can be quickly identified by checking whether the input is a quadratic residue modulo small integers. mpz_perfect_square_p first tests the input mod 256, which means just examining the low byte. Only 44 different values occur for squares mod 256, so 82.8% of inputs can be immediately identified as nonsquares. On a 32bit system similar tests are done mod 9, 5, 7, 13 and 17, for a total 99.25% of inputs identified as nonsquares.
[GMP]
The basic idea is to use the same kind of filtering but earlier in the process. Rather than, in effect, generating the next big presquare, squaring it, subtracting N, and only then finding the result rejected as a square on the grounds stated above by GMP, we can perform some preprocessing before the main search loop in order to avoid the work of generating that big square at all.
This is possible because for a given N, any k LSBs of the small square candidate depend only on the k LSBs of the large presquare. This follows from the fact that
 The k LSBs of the candidate for the small square are determined entirely by subtracting the k LSBs of N from the k LSBs of the big square.
 In turn, the k LSBs of the big square are determined entirely by the k LSBs of the big presquare, as in the lemma of Part 2.
This means that for any given N there is a fixed mapping from the k LSBs of the big presquare to the k LSBs of the small square candidate.
This fixed mapping means that if we can identify a pattern of k LSBs in a small square candidate that cannot be the LSBs of a square then we can rule out the corresponding LSB pattern in the big presquare that generated it. And this can be done once before the search starts and then used repeatedly in the search every time a potential big presquare has the same LSB pattern.
The explanation above is written in terms of bits and LSBs but clearly the arguments extend to other bases besides base 2, as the GMP manual implies.
Least significant digits (LSDs)
Rather than referring to “quadratic residues modulo small integers” [GMP], this article will often refer to the least significant digits of a square expressed in a base of a small integer. Of course these ideas are equivalent.
The Reduction Ratio
Let's define the reduction ratio (RR) of a filter as the ratio of the input range of a filter to the number of values that pass the filter. So to take the example from the GMP quote above, “Only 44 different values occur for squares mod 256”, this would be a reduction ratio of 256/44 or ~5.818.
In the experiments here, the object is to filter big presquare candidates and reduction ratios here implicitly refer to filtering these.
The motivating question is: what proportion of big presquare integers can the search skip over before it has to generate a small square candidate and test whether it is square? The higher the reduction ratio, the fewer integers the search loop will have to test. Against this needs to be weighed the cost of deriving the k LSDs of a big presquare candidate to use as input to a filter.
The Reduction Ratio Series
The reduction ratio series is defined as the sequence of reduction ratios generated in a given base as the the number of LSDs considered in the filter is incremented from 1.
Experimental Method
The experiments here investigate the relationships between the base and the number of LSDs, which together define the input to a big presquare filter, and its consequent reduction ratio.
I generated 1,000 experimental N values of the form pq by multiplying random large primes together. (Later I found that using numbers of the of the form pq was not necessary for the discoveries made, as I will show).
Initial experiments were in base 2 and used bits. For each N value the first experiment iterated over all possible bit patterns for the (for example 10) LSBs of a big presquare candidate and tested whether the resulting small square candidate could be proven not to be a square by inspecting only its k LSBs. where k <= 10. Note there is no significance to the choice of 10 bits here, except it sets an upper limit on k. There are 10 bits shown in the example output below, but some experiments used more than 20 bits.
A small square candidate can be rejected if its k LSBs alone prove it cannot be a square. The patterns in the LSBs of a big presquare resulting in a small square candidate rejected at k bits were counted as stop patterns at k bits.
For each N the programme printed a single line. For each value of k, the line showed the number of extra stop patterns discovered compared to the previous value of k, as the following example illustrates:
~/work # ./bits5.exe 10 debug pattern_mask = 1023, patterns_n = 1024 ============ number of bits in pattern = 10 total pattern space = 1024 bits=1: 0; bits=2: 512; bits=3: 256; bits=4: 0; bits=5: 0; bits=6: 0; bits=7: 0; bits=8: 0; bits=9: 0; bits=10: 0; bits=1: 0; bits=2: 512; bits=3: 0; bits=4: 256; bits=5: 0; bits=6: 128; bits=7: 32; bits=8: 32; bits=9: 8; bits=10: 8; bits=1: 0; bits=2: 512; bits=3: 0; bits=4: 256; bits=5: 128; bits=6: 0; bits=7: 0; bits=8: 0; bits=9: 0; bits=10: 0; bits=1: 0; bits=2: 512; bits=3: 256; bits=4: 0; bits=5: 0; bits=6: 0; bits=7: 0; bits=8: 0; bits=9: 0; bits=10: 0; ....
The first line here means that considering one bit (ie k = 1), there were no stop patterns. Considering two bits resulted in 512 stop patterns (half the pattern space). Considering three bits resulted in 256 additional stop patterns compared to considering two bits. Considering 4 bits resulted in no additional stop patterns, and so on.
Simple analysis with Unix tools sort and uniq showed immediately that the 1,000 randomlygenerated experimental N values produced only three unique series, as follows:
~/work # ./bits5.exe 10sort uniq c 1 debug pattern_mask = 1023, patterns_n = 1024 1 ============ number of bits in pattern = 10 245 bits=1: 0; bits=2: 512; bits=3: 0; bits=4: 256; bits=5: 0; bits=6: 128; bits=7: 32; bits=8: 32; bits=9: 8; bits=10: 8; 247 bits=1: 0; bits=2: 512; bits=3: 0; bits=4: 256; bits=5: 128; bits=6: 0; bits=7: 0; bits=8: 0; bits=9: 0; bits=10: 0; 508 bits=1: 0; bits=2: 512; bits=3: 256; bits=4: 0; bits=5: 0; bits=6: 0; bits=7: 0; bits=8: 0; bits=9: 0; bits=10: 0; 1 total pattern space = 1024 ~/work #
The relative frequency of these series also gave a clue to the bit patterns likely to be determining their membership. Approximately half the patterns were in one series, and for the other half, they were again split almost evenly between the two other series.
The same approach was then generalised from bits to an arbitrary base. A small number (for example 5, in the output that follows) of the LSDs of the number to be factorised were extracted in the chosen base. The following illustrates of the kind of output obtained, here using a maximum of 5 digits with base 3:
~/work # ./digits5 3 5 ============ number of digits in pattern = 5 total pattern space = 243 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 1d=81; 2d=108; 3d=18; 4d=12; 5d=2; lsd= 1. ...
The last line here for example means that considering 1 LSD (k = 1) resulted in 81 stop patterns (out of 243 patterns). Considering 2 LSDs resulted in an additional 108 stop patterns, and so on.
Now the same analysis with sort and uniq showed that there were only two series:
~/work # ./digits5 3 5sortuniq c 473 1d=162; 2d=0; 3d=0; 4d=0; 5d=0; lsd= 2. 527 1d=81; 2d=108; 3d=18; 4d=12; 5d=2; lsd= 1. 1 ============ number of digits in pattern = 5 1 total pattern space = 243
This experiment was continued with all the odd prime bases up to at least 31, using fewer digits for the higher bases both for reasons of tractability and because it was quickly apparent that higher numbers of digits in higher bases did not yield much increase in the number of stop patterns.
In every odd prime base tested there were always two series, with approximately equal numbers of the experimental N values generating each series.
Printing various numbers of LSDs of N on each line soon showed that for the odd bases, the LSD of N predicted entirely which of the two series N would generate. In the printout above “lsd=” shows the LSD for each N.
Another programme was then created to go through every odd prime base up to 73 and for every N value verify that its LSD predicted one of two reduction ratios at first LSD and that there were equal numbers of values of LSD predicting each reduction ratio. The LSD values with the lower reduction ratio at first digit were designated series A, the higher series B, with results tabulated.
Finally, instead of using the 1,000 experimental N values and taking a modular residue to derive the k LSDs, I iterated exhaustively through the whole possible space that the k LSDs might occupy. (That means all values, except values with the least significant digit 0). For all the bases and k values I tested this, it did not yield any new series or results different to the experimental N values.
Experimental Results
Experimental results are in Tables 1  3 and Figure 1. The reduction ratios are presented in the tables as floating point numbers to allow easier comparison between filters for different bases.
bits  11  101  001 

1  1.00  1.00  1.00 
2  2.00  2.00  2.00 
3  4.00  2.00  2.00 
4  4.00  4.00  4.00 
5  4.00  8.00  4.00 
6  4.00  8.00  8.00 
7  4.00  8.00  10.67 
8  4.00  8.00  16.00 
9  4.00  8.00  18.29 
10  4.00  8.00  21.33 
11  4.00  8.00  22.26 
12  4.00  8.00  23.27 
13  4.00  8.00  23.54 
14  4.00  8.00  23.81 
15  4.00  8.00  23.88 
16  4.00  8.00  23.95 
17  4.00  8.00  23.97 
base  1  2  3  4 

3/A  1.50  4.50  6.75  10.12 
3/B  3.00  3.00  3.00  3.00 
5/A  1.67  3.57  4.03  4.25 
5/B  2.50  2.50  2.50  2.50 
7/A  1.75  3.06  3.24  3.29 
7/B  2.33  2.33  2.33  2.33 
11/A  1.83  2.63  2.68  2.69 
11/B  2.20  2.20  2.20  2.20 
13/A  1.86  2.52  2.56  2.56 
13/B  2.17  2.17  2.17  2.17 
17/A  1.89  2.39  2.41  2.41 
17/B  2.12  2.12  2.12  2.12 
19/A  1.90  2.34  2.36  2.36 
19/B  2.11  2.11  2.11  2.11 
23/A  1.92  2.28  2.29  2.29 
23/B  2.09  2.09  2.09  2.09 
digits  3  5  7  11 

1  1.50  1.67  1.75  1.83 
2  4.50  3.57  3.06  2.63 
3  6.75  4.03  3.24  2.68 
4  10.12  4.25  3.29  2.69 
5  11.05  4.27  3.29  2.69 
6  11.76  4.28  3.29  2.69 
7  11.89  4.29  3.29  2.69 
The following empirical rules are generalisations from these results. As far as this article goes, they can be considered only as conjectures since there is no mathematical proof here and at least some of them cannot be tested over the entire range to which they apply.
For example “For every odd prime base, there are two series of reduction ratios” cannot be proven by exhaustion, although it might be demonstrated over a range that matters for a practical implementation.
Odd prime bases
 For every odd prime base, there are two series of reduction ratios.
 In one series (‘A’) the reduction ratio increases with increasing numbers of digits considered.
 In the other series (‘B’) the reduction ratio does not increase beyond the first digit.
 Which of the two series is generated, A or B, depends only on the LSD of N in that base, as shown in Figure 1.
 Series A and B are equally probable, in the sense that the possible values of the LSD are split evenly between them.
 The progressive, A series will have the lower initial reduction ratio at first digit.
 The bigger the difference between the reduction ratios of the two series at first digit, the more the reduction ratio in the A series will gain as the number of LSDs increases. Table 3 shows a comparison.
 The larger the base, the less the reduction ratio increase in the A series will be
 The larger the base, the closer to 2 is the reduction ratio of both series at first digit.
Even prime base, 2
 Base 2 is unique in having three series rather than two.
 The selection of the series depends only on the two or three LSBs of N, as shown in Table 1.
 In one of the series the reduction ratio continues to increase with increasing numbers of bits.
 In the other two series, the reduction ratio stops increasing after a few bits. In one it reaches a plateau of 4 at 3 bits and in the other it reaches a plateau of 8 at 5 bits.
The alternate integers optimisation of Part 2
Table 1 provides perspective on the optimisation of Part 2. That optimisation was to consider only alternate integers  either odd or even  for the presquare after examining bit 1 of N. Table 1 shows that this as the particular case of considering two bits in the filter, which results in a reduction ratio of 2 in all three series.
Using these results to speed up Fermat factorisation in YAFU
The following is an example of how these results might be applied to speeding up Fermat factorisation. It is an empirical demonstration of speeding up factorisation of a particular N value in YAFU.
How YAFU optimises Fermat factorisation
YAFU's code for zFermat() includes the comment “Fermat's factorization method with a sievebased improvement provided by 'neonsignal'”.
In its search loop, YAFU reduces the big presquare modulo 176,400 and uses an array with 176,400 elements, called skip[]
, indexed by this residue, to decide which values in that range are
possible big presquares. The impossible ones are skipped. The table is computed for N before the main search loop.
The modulus is represented in the code by the variable M and is set as follows:
uint32 M = 2 * 2 * 2 * 2 * 3 * 3 * 5 * 5 * 7 * 7;
I don't have a formal proof for this, but it is intuitive that filtering in this way with modulus ab is equivalent to separately filtering with modulus a and modulus b. Only a candidate that passes both filters a and b will pass the single filter ab.
This means that what YAFU is doing, in the terms used in this article, is filtering by considering 4 LSBs in base 2, two LSDs in base 3, two LSDs in base 5 and two LSDs in base 7. The opportunity for improvement comes from the fact that YAFU does this blindly, always using the same number of these digits regardless of the value of N.
(YAFU separately then filters the candidates that pass this first filter with another two filters, using bigger prime bases 11  37, but this is not relevant to this simple demonstration).
For speed,
conflicting desired qualities
for the skip[]
table are a large reduction ratio
and small size.
The motivation for a small table comes from the fact that the table has to be consulted on every iteration of the main search loop and that in general the smaller it is, the more it will remain in any memory cache.
A way to speed up YAFU
The results in Table 1 show that for base 2, considering 4 bits in the filter, as YAFU does, will yield a reduction ratio of 4 in all three reduction ratio series. However this is almost certainly suboptimal for any particular N.
For example, in half the cases, where N ends in bit pattern ...11, considering only 3 bits would yield the same reduction ratio. For this bit pattern, the skip table is twice as big as necessary, for no gain at all in reduction ratio.
Table 1 shows that conversely, where N ends in bit pattern ...101, increasing the number of LSBs considered to 5 would double the reduction ratio from 4 to 8, while increasing the skip table size by only a factor of two.
In the remaining case, where N ends in ...001, Table 1 shows that increasing the number of LSBs considered to 6 would increase the reduction ratio to 8. In the current implementation of YAFU this would quadruple the skip table size.
Turning to base 3, YAFU always uses two LSDs, but we know from the experimental results that on average half the numbers to factorise will generate series B. In the B series, the second digit will not increase the reduction ratio but it will cause an unwanted increase in the size of the skip array by a factor of 3.
YAFU also uses two digits of base 5 and base 7, so the same reasoning applies to them.
So in a worstcase example, for a value of N that generates series B for base 3, 5 and 7 the table size will be 3 x 5 x 7 = 105 times bigger than it needs to be, for no gain at all in filtering.
These observations lead to the idea of adjusting or tuning the value of M to suit the number being factorised.
At this point it is relevant to note an implementation detail of YAFU that appears suboptimal and which potentially can affect
choices made in such tuning. The array of 16bit integers called skip[]
in YAFU is implemented as a sparselypopulated array whose index is the offset
into M. This means its size is always M. It also means most of the array positions are unused
because they are at offsets that the search will never read. In the terminology of this article, they represent stop patterns.
It is possible to redesign the array skip[]
instead to be a list of offsets from one filterpassing presquare candidate
to the next.
Then the number of elements in skip[]
would be the number of candidates that pass the filter, rather then the number of candidates
in the input range of the filter. Compared to the existing design, the size of skip[]
would be reduced by a
factor equal to the reduction ratio of the filter.
This observation is not directly related to the experimental results reported here, being a detail of a particular filtering implementation. However, it might affect optimisation when there is a compromise between skip table size and overall reduction ratio (presumably in any real system with a finite memory cache). The reason for this is that in each design the skip table size is a different function of the factors included in M. The result of the difference is that for any particular N the compromise might be optimised at a different point.
For example, as mentioned above, for a value of
N that ends in binary ...101
if the number of LSBs considered was increased
to 5, then the reduction ratio would double to 8.
In the current YAFU design the size of the skip array would also double.
But in the
modified design,
where skip[]
is a list of offsets
the skip table size would not increase at all, because
while the input range would be doubled,
the reduction ratio would also double and the number of candidates passing the filter would remain the same.
To avoid complicating the simple demonstration below with this, I have chosen a value of N that is amenable to tuning that both reduces M and increases the reduction ratio at the same time, so that tradeoffs do not need to be discussed.
An empirical demonstration of tuning YAFU's modulus
The version of zFermat() taken from YAFU as described in Part 2 was modified to accept an argument for the value of M to use as the modulus in its first filter. This zFermat() version also prints its reduction ratio for this first filter. Several simplifications to it had also been made, including removing a multiplication factor. It was used to factorise a particular value of N, both with its existing default value of 176,400 and with a value tuned to suit this particular N and no other difference.
The time taken on the same desktop machine in each case was measured.
The number to factorise was in hex ec2ce121d47fde316096e3a8942d4002b5e88f1f7c19ecd055. This was selected to make a simple and convincing demonstration.
The way to tune M as demonstrated here is to take N and look at its 3 LSBs and to reduce it modulo each of the small odd primes used in M, namely 3, 5, and 7. This derives the LSD in each of these bases.
In this case these results are as follows. The three LSBs are 101. The LSDs in the odd bases are: base 3 LSD = 1, base 5 LSD = 3 and base 7 LSD = 6.
Now the LSD in each odd base can be used in Figure 1 to find the reduction ratio series that N generates in that base. These are: base 3 series A, base 5 series B, and base 7 series B.
So since both base 5 and base 7 are the nonprogressive B series, we can eliminate the second LSD in both those bases from M, and the reduction ratio will not change. This will reduce the size of M by a factor of 35.
Table 1 shows that for this
pattern in the LSBs of N, the reduction ratio will be 8 rather than 4 if we increase
the number of bits to 5.
In the current YAFU implementation this will double the size of the skip[]
array.
Table 2 shows that for base 3 series A if we increase the number of digits considered to 4 from the YAFU default of 2, the reduction ratio will increase from 4.50 to 10.12. This will also make the skip table 9 times larger.
The size of this array will increase by a total factor of 18 as a result of adding one digit in base 2 and the two digits in base 3. However, we have also reduced it by a factor of 35 by removing the second digits in both base 5 and base 7. So the tuned skip table will be smaller than the original.
Based on this reasoning, I set the experimental value of M to be
M = 2^{5} x 3^{4} x 5 x 7 = 90,720
The results of running this factorisation in YAFU was that with the default value of M=176,400 the time taken was 13.5 seconds compared with 2.9 seconds with the tuned value of M=90,720
The reduction ratio of this filter for the default was 176400/1680 = 105.000 whereas for the tuned value it was 90720/192 = 472.500.
This has demonstrated empirically the potential to increase the speed of Fermat factorisation in YAFU, at least in some cases.
The next article will include explorations of a general algorithm or table to select the best value for M for a given N in a filtering system like this.
References
 [GMP]
 Page 111, Chapter 15 Algorithms in GNU Multiple Precision Arithmetic Library Edition 6.2.1, dated 14 November 2020. PDF with SHA256 767b5f8755125e2975b0c0b6e502fea92adbd58f4970836bd47ea7bca6ae470a
 [BILISOLY]

Searching for Patterns among Squares Modulo p, Roger Bilisoly, 2016
In February 2022 this paper was available as a PDF (SHA256 e4b097df0259e7e7325dfe061692190c8537e7e613efa2f4b481c0148da20ad4) at https://arxiv.org/pdf/1612.05852v1, linked from https://arxiv.org/abs/1612.05852
Correction History
15 April 2022: Change from “The last line here for example means that considering 1 LSD (k = 1) resulted in 162 stop patterns”. to “The last line here for example means that considering 1 LSD (k = 1) resulted in 81 stop patterns”
15 April 2022 Change from “Against this needs to be weighed the cost of deriving the k LSDs of a big square candidate” to “Against this needs to be weighed the cost of deriving the k LSDs of a big presquare candidate”
15 April 2022 Change from “Reduction Ratios” to “reduction ratios” in caption of Figure 1.