How is this related to big data and data science, and why is it such a big deal?
It is important big data science in multiple ways. First, data security and encryption relies on algorithms that typically use an encryption key: the key - at the very core of these algorithms - is essentially the product of two very large prime numbers. While there has been new developments to produce different algorithms not based on keys or prime numbers, large prime numbers are still central to most encryption schemes. Recent security breaches (eBay, Twitter, Target) show how critical data security is. Anyone who can come up with an algorithm that can easily factor products of large prime numbers, could jeopardize the security of all financial transactions, electricity networks, personal data, military and corporate secrets, aerial navigation, and much more. Pretty much all of us have indeed spent countless hours trying to solve this very old problem, to no avail.
This is a problem for Internet pirates, aka reverse data scientists
In this article, I describe a methodology that might lead to a solution, or if not, one that leads to very interesting results and potentially applicable to other problems such as random number generation. And it definitely illustrates how chaos is far less chaotic and much more organized than most people think!
The other reason why this is an interesting data science problem is because the methodology proposed here involves creating and processing very big data to find patterns and insights that would help solve our problem: factoring a product of two large prime numbers. It is actually a very good application for data scientist candidates, to learn how to massage truly big data and find patterns. Note that in many similar mathematical problems, there is often an intensive data processing step, as well as theoretical investigations, to come up with a solution. In this article, we keep mathematics to a very elementary level, so that all of you can understand and participate. This has been added as an open project for candidates participating in our data science apprenticeship, called cracking the maths that make all financial transactions secure. You'll have to wear your pirate hat this time, to work on this project, as this is a kind of reverse-engineering adventure. We initially thought to include this in our challenge of the week series, but the scope goes far beyond that of a weekly challenge.
Proposed methodology to factor an integer
We assume that z = x*y is the integer that we want to factorize, with x and y being the two factors that we try to identify, typically two large prime numbers, which makes it a very hard problem from a computational complexity point of view.
We will use this generic iterative algorithm to factorize z:
Algorithm:
1. Start with x(0) = a, y(0) = b, where a and b are two parameters.
2. Then proceed iteratively as follows:
x(n+1) = f(x(n), y(n), z, c);
y(n+1) = g(x(n), y(n), z, d);
Where c, d are two parameters, and f, g are integer-valued functions satisfying
f(x, y, z, c) = x if x*y = z
g(x, y, z, d) = y if x*y = z
Two of the most basic functions satisfying the previous conditions are
f(x, y, z, c) = modulo{(z - x*y) + x, c}
g(x, y, z, d) = modulo{(x*y - z) + y, d}
where c and d are large enough integers, larger than the expected factors x and y (solution of z = x*y). We will use these two functions here.
The integer sequence (x(n), y(n)) with n = 0, 1, ... either converges to the solution (x, y), or becomes periodic for n large enough. This is a straightforward application of the fixed point theorem.
Indeed, this sequence converges very rarely, and most of the time, creates what seems like a chaotic series of numbers - so chaotic that such equations have been used to produce (poor) simulated random numbers - the opposite of what we are trying to do here: finding a strong pattern and fast convergence to (x, y). Yet, there are sets of parameters (a, b, c, d) that, given z = x*y, result in very fast convergence to the solution (x, y), solving the factorization problem and identifying the large prime numbers x and y in as few as 2 iterations.
The problem
The problem, is of course, how to identify these parameters (a, b, c, d) that lead to fast convergence, or at least convergence that is fast enough to make it faster than the best factorization algorithms currently known. A solution consists of generating a very large number of (a, b, c, d) combinations for a few different z (product of two primes x, y), check out the few (a, b, c, d) that lead to fast convergence and see if some patterns emerge among these combinations.
We started to look into this, and we run some experiments on small z (we know big z's behave very differently from small z's, just like big data behaves very different from small data), but we did identify a few interesting facts:
Related issues
Just curious, does anyone know about an initiative to factorize products of large primes using Hadoop? I'm sure this must even been tried, and probably long ago. Hadoop would work well for this kind of problems. Note that checking whether a number is prime or not is a much simpler problem, if you use probabilistic algorithms.
Finally, another approach to factoring z into x*y, is to replace x, y, z by three matrices with integer coefficients, X, Y, Z with det(X) = x, det(Y) = y, and det(Z) = z. Here 'det' means determinant, and we use the fact that det(X*Y) = det(X) * det(Y) = x*y, solving Z = X*Y rather than z = x*y. When X, Y, and Z are 2x2 matrices, a particular case is equivalent to working in the complex plane.
Related links
Comment
A better analysis: Let Z_c = Z/cZ be the ring of integers mod c. Then for any z, if u is a unit of Z_c then
u, (z/u mod c) is a fixed point of (x, y)-> (z-x*y) + x mod c. Thus, if c=d (as in my example) there are many fixed points that are not integer factorizations of z. If c != d, an extended analysis of common units in Z_c and Z_d
shows that it is easy to find c,d that have multiple fixed points for the given functions.
One might consider a multiplicative variant such as (x, y) -> (k^(z - x*y))*x mod c and (x,y) -> (l^(z-x*y))*y mod d. Typically p, q (the factors of z) are units of Z_c and Z_d (and if they are not then
gcd(c, z) or gcd(d,z) is likely to produce one of p or q). If k is a non-unit of Z_c then this function only produces non-units and never reaches p or q. If k is a unit, but we choose a non-unit starting value for x, we again get only non-units. In the case when k and x are units we stay in the coset of x relative to the subgroup of Z_c^* generated by k. It seems like we could miss both p and q quite often.
These two functions don't seem to work. Take z = 23 * 47, c = d = 727, a = 6 and b = 59.
Then f = 6 and g = 59 is a fixed point, but not a factorization of z. In fact, a search shows that there are many values of c and d that provide false solutions.
Here's a very interesting answer posted by one of our readers:
The matrix approach described in your last paragraph works as follows (if it ever works): start with initial matrices X(0), Y(0) with integer coefficients, compute the product Z(0) = X(0) * Y(0), then iteratively update the matrices X(n), Y(n) until det(X(n) * Y(n)) = z, making small adjustments at each iteration. Then det(X(n)) = x and det(Y(n)) = y, and the factorization is complete. At all times, the matrices must be integer-valued.
Instead of
use
where u(n) is a dummy sequence. You might even use two dummy sequences u(n) and v(n) rather than one. This will tremendously open up possibilities, and it's a classic in number theory, to solve complex problems.
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central