It is well known that there are fairly efficient algorithms for determining whether an integer is a prime, however, it is a whole other story for integer factorization. I will describe here an interesting randomized algorithm for integer factorization, which has a bunch of desirable properties:
- It is fast, with expected runtime being roughly gcd (greatest common divisor) operations.
- It only uses a constant amount of space.
It goes like this. Assume a given integer divisible by at least two prime factors (it is not a prime power), so we can write where are coprime and . To emphasise, in the following we will not know what are, we will only make use of the fact that they exist. Also, our goal will be to find one of these.
Imagine sampling numbers from without replacement, where we will stop sampling as soon as the newly sampled number satisfies
Any occurrence of (1) we will refer to as collision. Notice that there are only available bins modulo , hence, by the pigeonhole principle, the above is guaranteed to stop at most after samples. Most importantly, notice that after we stop, we have and thus . Then we are finished, as is a non-trivial factor of !
Consider thus the following algorithm:
- Keep sampling numbers as described above, one by one.
- Check the condition for all .
How many samples do we need to gather until we fulfil the condition ? Well, the discussion in the previous paragraph this is at most as many as we need for the collision (1) to occur. The question can be therefore rephrased as: How many samples do we need until for some ? But this is nothing else than the Birthday paradox!
“What is the probability that two people in a room share a birthday?” is completely analogous to our question “What is the probability that among samples two of them fall into same equivalence class modulo ?”. The answer can be bounded:
And so already with we have the probability of collision around . This tells us to expect that the collision occurs after taking only around samples. Since we considered to be given by , the amount of samples that we expect to need is .
The backbone of the Pollard’s rho factorization will be the following heuristic. Choose an integer and a polynomial (Usual choice is 2 and .). Instead of randomly sampling the sequence without replacement, we will generate it deterministically, obtaining a pseudo-random sequence:
To make any rigorous statements about the sufficiency of “randomness” introduced this way is very tough. However, there are (perhaps less satisfying) empirical reasons why using a pseudo-random sequence like this is sufficient: the Pollard’s factorization seems to be working well. One of the algorithm’s successes was factorization of the ninth Fermat number:
In any case, we are now in position to describe the algorithm in its fullness. Consider (conceptually) a graph whose vertices are the elements of . Having chosen and polynomial , we will consider an oriented edge between vertices and (both ) precisely when for some . As there are finitely many and our generation of the sequence by is completely deterministic, we will cycle in the graph after a while. The whole situation looks like this (And this is where the in the name of the algorithm comes from.):
Believing that the pseudo-random sequence behaves itself, the Birthday paradox discussion tells us that the length of the cycle will be around . Now, how do we detect a cycle in this conceptual graph? Well, using (1) we can test whether and fall to the same vertex in the graph by testing whether is a non-trivial factor of n. And this operation is all we need in order to employ the »Floyd’s Tortoise and Hare cycle detection algorithm« (Tortoise and Hare move on the graph with speeds 1 and 2 respectively. The first vertex that they meet at must be contained in a cycle.), which can detect a cycle in time proportional to the length of the cycle (and constant space). Summarizing, we can expect to find a cycle and thus a proper factor of in time of performing gcd operations. And all of that with a remarkably short and simple code:
Here, I have included an upper bound on the length of the sequence , so we avoid infinite loops by giving up prematurely. It is important to note that, whether due to the generation of being non-random or due to being a prime (power), the Pollard’s rho algorithm as described may not terminate. For any practical applications it is reasonable to check whether your integer is a prime power before starting Pollard’s rho. As a general consequence of how we reasoned about the algorithm, the smaller factor your integer has, the faster Pollard’s rho will tend to be. This is why it was successful with factoring , whose first factor 1238926361552897 is relatively “small”.
As the last thing, let’s see how the above procedure deals with randomly chosen integers between and , running on PyPy3, 2.7 GHz Intel Core i5. You can also try it by yourself below:
|Integer||Factorization||Time taken (s)||Notes|
|96764544874272622169||230237209 x 420281957441||0.060839|
|55758240055524291931||131 x 425635420271177801||0.000058|
|52595570150902145711||2713 x 19386498396941447||0.000287|
|54530065810466921693||54530065810466921693 x 1||4.221130||Prime|
|76937877947075472797||37 x 2079402106677715481||0.000028|
|87081594094471557253||29 x 3002813589464536457||0.000070|
|86157342400390552069||3461 x 24893771280089729||0.000111|
|28197957983033436721||28197957983033436721 x 1||4.274322||Prime|
|29951290030313188349||59 x 507648983564630311||0.000032|
|74279717678385918737||84509 x 878956296706693||0.002071|
|44657213935372994281||44657213935372994281 x 1||4.510721||Prime|
|97912152549056935669||74531 x 1313710436584199||0.001908|
|20228236878280923857||23 x 879488559925257559||0.000026|
|51555948593461122523||9497 x 5428656269712659||0.000365|
|53229062465988471241||131 x 406328721114415811||0.000071|
|31315421851347505283||10513 x 2978733173342291||0.000107|
|89179931205901546169||43 x 2073951888509338283||0.000041|
|53196997219848404297||59 x 901644020675396683||0.000039|
|50597420757695715781||421 x 120183897286688161||0.000130|
|85242256987119573301||23 x 3706185086396503187||0.000026|
|11855105309203805442489519385571||22801765463 x 519920500385852312917||0.300823|