Empirical explorations of faster Fermat factorisation, part 9
by Stephen Hewitt | Published
Introduction
This is the ninth in a series of technical articles presenting original work towards an open-source, free software implementation of the Fermat factorisation algorithm optimised for speed. It uses concepts and terminology introduced in the previous articles and is not intended to be understood without them. See the Related section below for the other articles in the series.
This article reports an experiment to measure the performance of Fermat factorisation with filtering using the idea of the previous article (Part 8) for filtering with two filters by considering them in parallel rather than sequentially. It presents the software written for this experiment, under a free software GPL licence. This is the first implementation using filtering that does not incorporate code from YAFU.
Experiment
The experiment compared the factorisation times of the following two programmes for the same N values on the same machine.
- New software written to implement the idea for the parallel combination of two filters of Part 8.
The fastest software presented in Part 7, modified to reduce the number of its filters from three to two. (This was the YAFU
zFermat()
function with a reduced skip table, with the first filter tuned to suit N.)
Both these programmes used the same filter values as each other. The experiment used the same 24 values of N used in the benchmarks of Part 7. Each of these belongs to a different category of N, as described in Part 5. The first of the two filters was tuned to suit the category of N, with the same values as used in Part 7, shown in its Table 1. The other filter was set to the original YAFU value of 133331.
As described in Part 8, the YAFU combining of the two filters is sequential. The first filter is stored as a series of skip values from one candidate that passes the filter to the next. A candidate passing this first filter is then presented to the second filter, which is stored as a look-up table in the form of bit map.
In contrast the combining of two filters of Part 8 is parallel and symmetrical. Both filters are stored as a table of skip values from one candidate that passes the filter to the next. The search steps through both tables in parallel until it finds a matching cumulative offset from the last valid candidate in both tables. The following code snippet illustrates the basic algorithm.
while (1) { if (advance_requested > advance2_requested) { advance2_requested += skip2[skindex2++]; if (skindex2 >= skip2_n) skindex2 = 0; } else if (advance2_requested > advance_requested) { advance_requested += skip[skindex++]; if (skindex >= skip_n) skindex = 0; } else { Form the little square candidate here and test whether it is square. Break from loop if found. // now start next round advance_requested = skip[skindex++]; if (skindex >= skip_n) skindex = 0; advance2_requested = skip2[skindex2++]; if (skindex2 >= skip2_n) skindex2 = 0; } }
Results
The results of this experiment are shown in Table 1.
N category | Sequential | Parallel | Relative time | Tuned RR |
---|---|---|---|---|
AAA1 | 34.7s | 537.4s | 15.5 | 2362.5 |
AAA3 | 147.5s | 572.6s | 3.9 | 443.0 |
AAA5 | 53.2s | 320.9s | 6.0 | 885.9 |
AAB1 | 46.9s | 240.6s | 5.1 | 700.0 |
AAB3 | 233.9s | 443.8s | 1.9 | 131.2 |
AAB5 | 226.7s | 590.5s | 2.6 | 262.5 |
ABA1 | 15.2s | 165.0s | 10.9 | 1653.8 |
ABA3 | 133.1s | 382.7s | 2.9 | 310.1 |
ABA5 | 132.8s | 669.2s | 5.0 | 620.2 |
ABB1 | 73.8s | 308.7s | 4.2 | 490.0 |
ABB3 | 805.6s | 1340.7s | 1.7 | 91.9 |
ABB5 | 328.6s | 711.6s | 2.2 | 183.8 |
BAA1 | 24.7s | 274.6s | 11.1 | 1800.0 |
BAA3 | 221.0s | 673.6s | 3.0 | 337.5 |
BAA5 | 116.2s | 638.7s | 5.5 | 675.0 |
BAB1 | 182.7s | 776.4s | 4.2 | 533.3 |
BAB3 | 687.6s | 1172.6s | 1.7 | 100.0 |
BAB5 | 189.5s | 428.6s | 2.3 | 200.0 |
BBA1 | 51.7s | 467.5s | 9.0 | 1260.0 |
BBA3 | 187.4s | 485.2s | 2.6 | 236.2 |
BBA5 | 143.2s | 577.2s | 4.0 | 472.5 |
BBB1 | 182.2s | 630.7s | 3.5 | 373.3 |
BBB3 | 639.1s | 1005.5s | 1.6 | 70.0 |
BBB5 | 335.6s | 669.9s | 2.0 | 140.0 |
Discussion
The parallel algorithm for combining two filters was significantly slower than the sequential bit map implementation of YAFU, in some cases by more than an order of magnitude. However, the ratio between their times varies greatly with the category of N, from 1.6 to 15.5, and seems to correlate with the RR of the tuned filter. This suggests that the parallel filter combining might be improved by a different distribution of the prime factors between the two filters.
On theoretical reasoning, it may be that time is lost in the parallel algorithm when one of the filters has a very large skip value but skip values in the other filter are small. (For example the maximum skip value in the tuned filter for N of class AAA1 is 14,080 but in the other filter the maximum skip is only 134). Then the algorithm will have to make multiple increments in the other filter table to catch up with the the large skip. The algorithm using a bit map for the second filter can make a random-access query of the second filter at the same cost regardless of the length of the skip from the first filter.
GPL free software to download
Here is the experimental factorisation software using the parallel filter combining algorithm. It consists of the following files with the following SHA256 hashes at the following locations on this website.
5e7a51d83f82c5f9f61b3bea3efb64bbe6634935941ec00ada0ca0a939f4598f /download/check2.cc 413175e31f6ac6c7121b3a21f485c57c885c24e1b8a48677098b2bccddbcbd9b /download/check2.h 26d7ef95ff0a1346ac9986872381fd83f453d79dfc58123d349faf29b5c8b546 /download/demo_dual.cc a2a6bc5730e69bcd769191b5073e227318e797216f17b249003fb281bcd8a2a7 /download/dual.cc d790ae488a97b6b2480d193139e2d50bdf0b850cda1f9d226a10e01f3bf80f54 /download/dual.h 4a6d20f1c87d5ef32c9669f89d0d97d92706fe251549cf81e0e6b22dd60df788 /download/filter.cc 00a4ca0bfec01ac34fe77077334db44eb2d5d1cfd54bfc905fda1a829257db29 /download/filter.h 8037db0a4b7fc32e9e96741950e0d154e34c8580c680c22571fb1503522f4df8 /download/optima.gh 53301afc31eb5d82e2bf033abf91800a5fea36a67e85fe821324f38a32d64b3c /download/result.h
It requires the GMP library. On GNU/Linux it has been built with
g++ --std=c++17 demo_dual.cc filter.cc dual.cc check2.cc -l:libgmp.a -o demo_dual
Terminology
- big square
- Successful factorisation finds b2 - s2 = N where b2 is called here the big square. See Part 1.
- GPL
- GNU general public license
- LSB
- least significant bit
- LSD
- least significant digit
- M
- The input range, or modulus, of the filter. In effect, what happens during filtering is that a big presquare candidate is reduced mod M and the residue tested in the filter to discover whether it is viable. This is named after the variable in the YAFU code. See Part 3.
- N
- The number to factorise, of the form pq where p and q are large primes; possibly the modulus of an RSA public key.
- presquare
- The root x of any square x2 will be called the presquare.
- RR
- reduction ratio, defined in Part 3.
- RSA
- Rivest Shamir Adleman
- small square
- Successful factorisation finds b2 - s2 = N where s2 is called here the small square. See Part 1.
- TR
- table ratio. This is the number of elements needed in a table to implement a filter. It is the same as the number of candidates that pass the filter in the input range M of the filter. If M is the moduluse of the filter then by their definitions RR = M/TR.
- YAFU
- Yet Another Factorisation Utility. See Part 2.