It is well known that there are fairly efficient algorithms for determining whether an integer is a prime, however, it is a whole other story for integer factorization. I will describe here an interesting randomized algorithm for integer factorization, which has a bunch of desirable properties:

  • It is fast, with expected runtime being roughly gcd (greatest common divisor) operations.
  • It only uses a constant amount of space.

It goes like this. Assume a given integer divisible by at least two prime factors (it is not a prime power), so we can write where are coprime and . To emphasise, in the following we will not know what are, we will only make use of the fact that they exist. Also, our goal will be to find one of these.

Imagine sampling numbers from without replacement, where we will stop sampling as soon as the newly sampled number satisfies

Any occurrence of (1) we will refer to as collision. Notice that there are only available bins modulo , hence, by the pigeonhole principle, the above is guaranteed to stop at most after samples. Most importantly, notice that after we stop, we have and thus . Then we are finished, as is a non-trivial factor of !

Consider thus the following algorithm:

  • Keep sampling numbers as described above, one by one.
  • Check the condition for all .

How many samples do we need to gather until we fulfil the condition ? Well, the discussion in the previous paragraph this is at most as many as we need for the collision (1) to occur. The question can be therefore rephrased as: How many samples do we need until for some ? But this is nothing else than the Birthday paradox!

“What is the probability that two people in a room share a birthday?” is completely analogous to our question “What is the probability that among samples two of them fall into same equivalence class modulo ?”. The answer can be bounded:

And so already with we have the probability of collision around . This tells us to expect that the collision occurs after taking only around samples. Since we considered to be given by , the amount of samples that we expect to need is .

The backbone of the Pollard’s rho factorization will be the following heuristic. Choose an integer and a polynomial (Usual choice is 2 and .). Instead of randomly sampling the sequence without replacement, we will generate it deterministically, obtaining a pseudo-random sequence:

To make any rigorous statements about the sufficiency of “randomness” introduced this way is very tough. However, there are (perhaps less satisfying) empirical reasons why using a pseudo-random sequence like this is sufficient: the Pollard’s factorization seems to be working well. One of the algorithm’s successes was factorization of the ninth Fermat number:

In any case, we are now in position to describe the algorithm in its fullness. Consider (conceptually) a graph whose vertices are the elements of . Having chosen and polynomial , we will consider an oriented edge between vertices and (both ) precisely when for some . As there are finitely many and our generation of the sequence by is completely deterministic, we will cycle in the graph after a while. The whole situation looks like this (And this is where the in the name of the algorithm comes from.):

Believing that the pseudo-random sequence behaves itself, the Birthday paradox discussion tells us that the length of the cycle will be around . Now, how do we detect a cycle in this conceptual graph? Well, using (1) we can test whether and fall to the same vertex in the graph by testing whether is a non-trivial factor of n. And this operation is all we need in order to employ the »Floyd’s Tortoise and Hare cycle detection algorithm« (Tortoise and Hare move on the graph with speeds 1 and 2 respectively. The first vertex that they meet at must be contained in a cycle.), which can detect a cycle in time proportional to the length of the cycle (and constant space). Summarizing, we can expect to find a cycle and thus a proper factor of in time of performing gcd operations. And all of that with a remarkably short and simple code:

from fractions import gcd

def pollards_rho(n, x0=2, f=lambda x: x*x + 1, max_iter=10**6):
    x, y = x0, x0
    for _ in range(max_iter):
        x, y = f(x) % n, f(f(y)) % n # tortoise, hare
        d = gcd(abs(x - y), n)
        if d > 1:
            return d
    return n

Here, I have included an upper bound on the length of the sequence , so we avoid infinite loops by giving up prematurely. It is important to note that, whether due to the generation of being non-random or due to being a prime (power), the Pollard’s rho algorithm as described may not terminate. For any practical applications it is reasonable to check whether your integer is a prime power before starting Pollard’s rho. As a general consequence of how we reasoned about the algorithm, the smaller factor your integer has, the faster Pollard’s rho will tend to be. This is why it was successful with factoring , whose first factor 1238926361552897 is relatively “small”.

As the last thing, let’s see how the above procedure deals with randomly chosen integers between and , running on PyPy3, 2.7 GHz Intel Core i5. You can also try it by yourself below:

Integer Factorization Time taken (s) Notes
96764544874272622169 230237209 x 420281957441 0.060839  
55758240055524291931 131 x 425635420271177801 0.000058  
52595570150902145711 2713 x 19386498396941447 0.000287  
54530065810466921693 54530065810466921693 x 1 4.221130 Prime
76937877947075472797 37 x 2079402106677715481 0.000028  
87081594094471557253 29 x 3002813589464536457 0.000070  
86157342400390552069 3461 x 24893771280089729 0.000111  
28197957983033436721 28197957983033436721 x 1 4.274322 Prime
29951290030313188349 59 x 507648983564630311 0.000032  
74279717678385918737 84509 x 878956296706693 0.002071  
44657213935372994281 44657213935372994281 x 1 4.510721 Prime
97912152549056935669 74531 x 1313710436584199 0.001908  
20228236878280923857 23 x 879488559925257559 0.000026  
51555948593461122523 9497 x 5428656269712659 0.000365  
53229062465988471241 131 x 406328721114415811 0.000071  
31315421851347505283 10513 x 2978733173342291 0.000107  
89179931205901546169 43 x 2073951888509338283 0.000041  
53196997219848404297 59 x 901644020675396683 0.000039  
50597420757695715781 421 x 120183897286688161 0.000130  
85242256987119573301 23 x 3706185086396503187 0.000026  
11855105309203805442489519385571 22801765463 x 519920500385852312917 0.300823