Empirical explorations of faster Fermat factorisation, part 9

by | Published


This is the ninth in a series of technical articles presenting original work towards an open-source, free software implementation of the Fermat factorisation algorithm optimised for speed. It uses concepts and terminology introduced in the previous articles and is not intended to be understood without them. See the Related section below for the other articles in the series.

This article reports an experiment to measure the performance of Fermat factorisation with filtering using the idea of the previous article (Part 8) for filtering with two filters by considering them in parallel rather than sequentially. It presents the software written for this experiment, under a free software GPL licence. This is the first implementation using filtering that does not incorporate code from YAFU.


The experiment compared the factorisation times of the following two programmes for the same N values on the same machine.

  1. New software written to implement the idea for the parallel combination of two filters of Part 8.
  2. The fastest software presented in Part 7, modified to reduce the number of its filters from three to two. (This was the YAFU zFermat() function with a reduced skip table, with the first filter tuned to suit N.)

Both these programmes used the same filter values as each other. The experiment used the same 24 values of N used in the benchmarks of Part 7. Each of these belongs to a different category of N, as described in Part 5. The first of the two filters was tuned to suit the category of N, with the same values as used in Part 7, shown in its Table 1. The other filter was set to the original YAFU value of 133331.

As described in Part 8, the YAFU combining of the two filters is sequential. The first filter is stored as a series of skip values from one candidate that passes the filter to the next. A candidate passing this first filter is then presented to the second filter, which is stored as a look-up table in the form of bit map.

In contrast the combining of two filters of Part 8 is parallel and symmetrical. Both filters are stored as a table of skip values from one candidate that passes the filter to the next. The search steps through both tables in parallel until it finds a matching cumulative offset from the last valid candidate in both tables. The following code snippet illustrates the basic algorithm.

    while (1)
        if (advance_requested > advance2_requested)
            advance2_requested += skip2[skindex2++];

            if (skindex2 >= skip2_n)
                skindex2 = 0;
        else if (advance2_requested > advance_requested)
            advance_requested += skip[skindex++];

            if (skindex >= skip_n)
                skindex = 0;
            Form the little square candidate here and test whether it is square.  Break from loop if found.

            // now start next round
            advance_requested = skip[skindex++];
            if (skindex >= skip_n)
                skindex = 0;

            advance2_requested = skip2[skindex2++];
            if (skindex2 >= skip2_n)
                skindex2 = 0;


The results of this experiment are shown in Table 1.

Table 1: Empirical measurements of Fermat factorisation time in seconds for two different implementations: 1. The YAFU zFermat() function, modified to use a compact filter table and with a tuned first filter as in Part 7 but with the third filter removed, leaving two filters. 2. The same two filters but with the new parallel filter combining algorithm. The measured times are shown for 24 categories of number to factorise. The last column shows the filter modulus used in the tuned filter.
N category Sequential Parallel Relative time Tuned RR
AAA1 34.7s 537.4s 15.5 2362.5
AAA3 147.5s 572.6s 3.9 443.0
AAA5 53.2s 320.9s 6.0 885.9
AAB1 46.9s 240.6s 5.1 700.0
AAB3 233.9s 443.8s 1.9 131.2
AAB5 226.7s 590.5s 2.6 262.5
ABA1 15.2s 165.0s 10.9 1653.8
ABA3 133.1s 382.7s 2.9 310.1
ABA5 132.8s 669.2s 5.0 620.2
ABB1 73.8s 308.7s 4.2 490.0
ABB3 805.6s 1340.7s 1.7 91.9
ABB5 328.6s 711.6s 2.2 183.8
BAA1 24.7s 274.6s 11.1 1800.0
BAA3 221.0s 673.6s 3.0 337.5
BAA5 116.2s 638.7s 5.5 675.0
BAB1 182.7s 776.4s 4.2 533.3
BAB3 687.6s 1172.6s 1.7 100.0
BAB5 189.5s 428.6s 2.3 200.0
BBA1 51.7s 467.5s 9.0 1260.0
BBA3 187.4s 485.2s 2.6 236.2
BBA5 143.2s 577.2s 4.0 472.5
BBB1 182.2s 630.7s 3.5 373.3
BBB3 639.1s 1005.5s 1.6 70.0
BBB5 335.6s 669.9s 2.0 140.0


The parallel algorithm for combining two filters was significantly slower than the sequential bit map implementation of YAFU, in some cases by more than an order of magnitude. However, the ratio between their times varies greatly with the category of N, from 1.6 to 15.5, and seems to correlate with the RR of the tuned filter. This suggests that the parallel filter combining might be improved by a different distribution of the prime factors between the two filters.

On theoretical reasoning, it may be that time is lost in the parallel algorithm when one of the filters has a very large skip value but skip values in the other filter are small. (For example the maximum skip value in the tuned filter for N of class AAA1 is 14,080 but in the other filter the maximum skip is only 134). Then the algorithm will have to make multiple increments in the other filter table to catch up with the the large skip. The algorithm using a bit map for the second filter can make a random-access query of the second filter at the same cost regardless of the length of the skip from the first filter.

GPL free software to download

Here is the experimental factorisation software using the parallel filter combining algorithm. It consists of the following files with the following SHA256 hashes at the following locations on this website.

5e7a51d83f82c5f9f61b3bea3efb64bbe6634935941ec00ada0ca0a939f4598f  /download/check2.cc
413175e31f6ac6c7121b3a21f485c57c885c24e1b8a48677098b2bccddbcbd9b  /download/check2.h
26d7ef95ff0a1346ac9986872381fd83f453d79dfc58123d349faf29b5c8b546  /download/demo_dual.cc
a2a6bc5730e69bcd769191b5073e227318e797216f17b249003fb281bcd8a2a7  /download/dual.cc
d790ae488a97b6b2480d193139e2d50bdf0b850cda1f9d226a10e01f3bf80f54  /download/dual.h
4a6d20f1c87d5ef32c9669f89d0d97d92706fe251549cf81e0e6b22dd60df788  /download/filter.cc
00a4ca0bfec01ac34fe77077334db44eb2d5d1cfd54bfc905fda1a829257db29  /download/filter.h
8037db0a4b7fc32e9e96741950e0d154e34c8580c680c22571fb1503522f4df8  /download/optima.gh
53301afc31eb5d82e2bf033abf91800a5fea36a67e85fe821324f38a32d64b3c  /download/result.h

It requires the GMP library. On GNU/Linux it has been built with

g++ --std=c++17 demo_dual.cc filter.cc dual.cc check2.cc -l:libgmp.a -o demo_dual


big square
Successful factorisation finds b2 - s2 = N where b2 is called here the big square. See Part 1.
GNU general public license
least significant bit
least significant digit
The input range, or modulus, of the filter. In effect, what happens during filtering is that a big presquare candidate is reduced mod M and the residue tested in the filter to discover whether it is viable. This is named after the variable in the YAFU code. See Part 3.
The number to factorise, of the form pq where p and q are large primes; possibly the modulus of an RSA public key.
The root x of any square x2 will be called the presquare.
reduction ratio, defined in Part 3.
Rivest Shamir Adleman
small square
Successful factorisation finds b2 - s2 = N where s2 is called here the small square. See Part 1.
table ratio. This is the number of elements needed in a table to implement a filter. It is the same as the number of candidates that pass the filter in the input range M of the filter. If M is the moduluse of the filter then by their definitions RR = M/TR.
Yet Another Factorisation Utility. See Part 2.