# How is this related to big data and data science, and why is it such a big deal?

It is important big data science in multiple ways. First, data security and encryption relies on algorithms that typically use an encryption key: the key - at the very core of these algorithms - is essentially the product of two very large prime numbers. While there has been new developments to produce different algorithms not based on keys or prime numbers, large prime numbers are still central to most encryption schemes. Recent security breaches (eBay, Twitter, Target) show how critical data security is. Anyone who can come up with an algorithm that can easily factor products of large prime numbers, could jeopardize the security of all financial transactions, electricity networks, personal data, military and corporate secrets, aerial navigation, and much more. Pretty much all of us have indeed spent countless hours trying to solve this very old problem, to no avail. This is a problem for Internet pirates, aka reverse data scientists

In this article, I describe a methodology that might lead to a solution, or if not, one that leads to very interesting results and potentially applicable to other problems such as random number generation. And it definitely illustrates how chaos is far less chaotic and much more organized than most people think!

The other reason why this is an interesting data science problem is because the methodology proposed here involves creating and processing very big data to find patterns and insights that would help solve our problem: factoring a product of two large prime numbers. It is actually a very good application for data scientist candidates, to learn how to massage truly big data and find patterns. Note that in many similar mathematical problems, there is often an intensive data processing step, as well as theoretical investigations, to come up with a solution. In this article, we keep mathematics to a very elementary level, so that all of you can understand and participate. This has been added as an open project for candidates participating in our data science apprenticeship, called cracking the maths that make all financial transactions secure. You'll have to wear your pirate hat this time, to work on this project, as this is a kind of reverse-engineering adventure. We initially thought to include this in our challenge of the week series, but the scope goes far beyond that of a  weekly challenge.

Proposed methodology to factor an integer

We assume that z = x*y is the integer that we want to factorize, with x and y being the two factors that we try to identify, typically two large prime numbers, which makes it a very hard problem from a computational complexity point of view.

We will use this generic iterative algorithm to factorize z:

Algorithm:

1. Start with x(0) = a, y(0) = b, where a and b are two parameters.

2. Then proceed iteratively as follows:

x(n+1) = f(x(n), y(n), z, c);
y(n+1) = g(x(n), y(n), z, d);

Where c, d are two parameters, and f, g are integer-valued functions satisfying

f(x, y, z, c) = x if x*y = z
g(x, y, z, d) = y if x*y = z

Two of the most basic functions satisfying the previous conditions are

f(x, y, z, c) = modulo{(z - x*y) + x, c}
g(x, y, z, d) = modulo{(x*y - z) + y, d}

where c and d are large enough integers, larger than the expected factors x and y (solution of  z = x*y). We will use these two functions here.

The integer sequence (x(n), y(n)) with n = 0, 1, ... either converges to the solution (x, y), or becomes periodic for n large enough. This is a straightforward application of the fixed point theorem.

Indeed, this sequence converges very rarely, and most of the time, creates what seems like a chaotic series of numbers - so chaotic that such equations have been used to produce (poor) simulated random numbers - the opposite of what we are trying to do here: finding a strong pattern and fast convergence to (x, y). Yet, there are sets of parameters (a, b, c, d) that, given  z = x*y, result in very fast convergence to the solution (x, y), solving the factorization problem and identifying the large prime numbers x and y in as few as 2 iterations.

The problem

The problem, is of course, how to identify these parameters (a, b, c, d) that lead to fast convergence, or at least convergence that is fast enough to make it faster than the best factorization algorithms currently known. A solution consists of generating a very large number of (a, b, c, d) combinations for a few different z (product of two primes x, y), check out the few (a, b, c, d) that lead to fast convergence and see if some patterns emerge among these combinations.

We started to look into this, and we run some experiments on small z (we know big z's behave very differently from small z's, just like big data behaves very different from small data), but we did identify a few interesting facts:

1. If c or d is a power of 2, or a number with many divisors, there are far more (a, b, c, d) likely to result in convergence, given a pre-specified z = x * y.
2. If a (or b) is equal to x-1, x-2, 2*x, or 3*x, we can easily get very fast convergence. This is useless though, as it assumes that we know x beforehand, which is the number that we try very hard to uncover.
3. Using b = int(z / a), so that the initial estimates x(0), y(0) have a product a*b close to z, we observed a spectacular order in the (a, b, c, d) configurations that lead to convergence. For instance, if x=101, y=61 (thus z=6581), c=128, d=207, z/2 < a < z, and b = int(z / a),  then convergence is guaranteed if and only if modulo(a, d) belongs to {156, 13, 54, 154}. The number of iterations required for convergence is a function of modulo(a, d), with only 5 iterations if modulo(a, d) = 156, and 20 iterations if modulo(a, d) = 54. Hence our claim that this chaos is indeed.... remarquably organized!
4. Not all z's are created equal: z = 61*101 seems rather easy to solve, in the sense that many (a, b, c, d) work, with identifiable patterns, while z= 83*61 seems much tougher. So this methodology could work well to factor some z's and maybe not for other z's. Which z's are great candidates? This is an open question.
5. Even if you use a=1 and b=1, you get many instances of convergence (c and d values yielding convergence) especially if c or d has many divisors. But convergence is very slow, way too slow to be of any value when a = b =1.
6. An interesting exercise is to generate tons of (c, d) for a given z = x*y, and check which ones result in fast / slow / no convergence for many (say 10 million) values of (a, b). Rank these (c, d) according to number of couples (a, b) - out of 10 million - that yield convergence. Eliminate trivial couples (a, b) such as those listed in item 2. Find patterns.

Related issues

Just curious, does anyone know about an initiative to factorize products of large primes using Hadoop? I'm sure this must even been tried, and probably long ago. Hadoop would work well for this kind of problems. Note that checking whether a number is prime or not is a much simpler problem, if you use probabilistic algorithms

Finally, another approach to factoring z into x*y, is to replace x, y, z by three matrices with integer coefficients, X, Y, Z with det(X) = x, det(Y) = y, and det(Z) = z. Here 'det' means determinant, and we use the fact that det(X*Y) = det(X) * det(Y) = x*y, solving Z = X*Y rather than z = x*y. When X, Y, and Z are 2x2 matrices, a particular case is equivalent to working in the complex plane.

Views: 6658

Comment

Join Data Science Central Comment by Colin Bailey on August 28, 2014 at 4:51pm

A better analysis: Let Z_c = Z/cZ be the ring of integers mod c. Then for any z, if u is a unit of Z_c then

u, (z/u mod c) is a fixed point of (x, y)-> (z-x*y) + x mod c. Thus, if c=d (as in my example) there are many fixed points that are not integer factorizations of z. If c != d, an extended analysis of common units in Z_c and Z_d

shows that it is easy to find c,d that have multiple fixed points for the given functions.

One  might consider a multiplicative variant such as (x, y) -> (k^(z - x*y))*x mod c and (x,y) -> (l^(z-x*y))*y mod d. Typically p, q (the factors of z) are units of Z_c and Z_d (and if they are not then

gcd(c, z) or gcd(d,z) is likely to produce one of p or q).  If k is a non-unit of Z_c then this function only produces non-units and never reaches p or q. If k is a unit, but we choose a non-unit starting value for x, we again get only non-units. In the case when k and x are units we stay in the coset of x relative to the subgroup of Z_c^* generated by k. It seems like we could miss both p and q quite often. Comment by Colin Bailey on August 27, 2014 at 1:25am

These two functions don't seem to work. Take z = 23 * 47, c = d = 727, a = 6 and b = 59.

Then f = 6 and g = 59 is a fixed point, but not a factorization of z. In fact, a search shows that there are many values of c and d  that provide false solutions. Comment by Vincent Granville on August 23, 2014 at 7:38pm

Dear dr. Granville,
I have been briefly looking at the questions posed in the "Great data challenge: factoring the product of two large primes". I believe your proposed method is a special case of Pollard's rho method for factoring integers, and its generalization in the algebraic group factoring algorithms.
If correct, I believe these algorithms can be shown to be inferior in terms of speed to other integer factoring algorithms for very large integers such as those used for securing financial transactions (i.e. the RSA numbers, most notably RSA 1024).
The most successful integer factoring algorithm in recent years is the General Number Field Sieve (and the its cousin the quadratic sieve). Almost all of the recent efforts in integer factoring has revolved around slightly optimizing the implementation of this GNFS algorithm and drastically improving the computational power used.
Quite recent efforts are looking at Hadoop as a novel platform for factoring integers. However, a limiting factor in the current design of Hadoop is that data is stored in HDFS (on disk). This makes the read-write speed of the disk a severely limiting factor compared to being able to computing the factors in RAM in a setup using MPI (Message Passing Interface) for parallelism. Another important point is that factoring integers is primarily a computer-intensive task, and not necessarily a data-intensive task. In my opinion, Hadoop is primarily designed to be efficient in managing data.
In short, I believe the proposed method can not be faster or better for factoring very large integers (the RSA numbers especially). I did already write a short brute-force program to test your proposed method for the semiprimes consisting of a combination of primes in 1 - 100.000. Nonetheless I feel this is a very useful exercise for beginning data scientists such as myself, just to get a feel of working with computer-intensive tasks. I might even consider to take this as my project for the data science apprenticeship.
What are your thoughts on my remarks above?

ir. Tom De Smedt
MSc(eng) Biomedical Engineering Comment by Mirko Krivanek on June 10, 2014 at 8:01am

The matrix approach described in your last paragraph works as follows (if it ever works): start with initial matrices X(0), Y(0) with integer coefficients, compute the product Z(0) = X(0) * Y(0), then iteratively update the matrices X(n), Y(n) until det(X(n) * Y(n)) = z, making small adjustments at each iteration. Then det(X(n)) = x and det(Y(n)) = y, and the factorization is complete. At all times, the matrices must be integer-valued. Comment by Mirko Krivanek on June 5, 2014 at 3:40pm