The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].
This article attempts to explain the concept of Locality Sensitive Hashing in simplistic and practical terms:
The concept of Locality Sensitive Hashing has been around for some time now with publications dating back as far as 1999  exploring its use for breaking the curse of dimensionality in nearest neighbor query problems. Since this time various applications of Locality Sensitive Hashing have been making appearances in academic publications all over the world. Even very successful technology companies like Google have published improved LSH algorithms  using a consistent weighted sampling method "where the probability of drawing identical samples for a pair of inputs is equal to their Jaccard similarity" . In fact, college textbooks produced from some of America's most prestigious universities like Stanford now include entire chapters dedicated to finding similar items using Locality Sensitive Hashing .
However, well over ten years after the first publication, Data Scientists are still hard pressed to find practical implementations of Locality Sensitive Hashing systems which demonstrate why this concept is useful in "big data" applications, or how one would go about creating such a system for finding similar items.
The exponential growth of data over the past twenty years has now created many instances where collecting and retaining all of the relevant information for finding similar items is not feasible. First, consider indexing all the web pages in existence for the purpose of creating a new webpage search engine. Making a copy of every single web page's displayed text would not be practical for a number of reasons. The disk space alone required for such a task would be gigantic, and searches against such a large amount of data would prove very inefficient when considering the expectations of the current web user.
Next, we will consider one additional form of unstructured data which is not quite as obvious... Let us now consider a gene sequence. Imagine breaking apart a gene sequence's text into smaller chunks for the purposes of machine learning and finding other similar sequences. Perhaps you are trying to identify the unknown taxonomy hierarchy of a gene sequence, or you are looking at partitions of an entire human genome to rapidly find segments which may contain high similarity to a known genetic mutation. Even if we created "smaller chunks" of only 100 characters in length, starting at each position within the entire gene sequence's text, a gene sequence containing only 4 unique characters A, C, T, and G could generate 10^80 possible unique values. On the surface, this may not sound like a "big data" problem. However, we could all agree that even a reduced version of such a model would not fit on a lap top for use in a remote village in Africa.
Locality Sensitive Hashing can be used to address both of the challenges described above. It is a technique for fitting very big feature spaces into unusually small places. Likewise even smaller feature spaces can also benefit from the use of Locality Sensitive Hashing by drastically reducing required search times and disk space requirements. Instead of storing and searching against all available raw data or even random samples of all raw data, we can use LSH techniques to create very compact signatures which replace storing all of the features typically required for such searches. For example, in the case of displayed webpage text, all displayed text tokens sampled from a document now become a small collection of integers which will be stored and used for all subsequent similar webpage comparisons.
Using the signatures produced by Locality Sensitive Hashing exponentially reduces both storage space and processing time requirements for similar item searches.
A form of Locality Sensitive Hashing called Minhashing reduces feature space size using a family of random hashing functions to hash each individual piece of raw input data retaining only the minimum values produced by each unique hashing function.
Wait a minute...? What did he just say...? What does this actually mean...?
Let us start with a very practical example. I am using this next example for simplicity only. I do not recommend that it is the best way to go about comparing text documents for similarity... I have written one article with text tokenization examples using MapReduce  style processing and another using N-grams  which provide additional details on this topic. For this example however, imagine that we are tokenizing documents into individual words or other token forms by whatever process we use. If we wanted to compare our documents for similarity using this strategy, we would have to maintain a collection of all words produced during tokenization by document, and the frequency that each word occurred as well (for frequency weighted calculations). This "word" collection would quickly grow in size as the number of documents increased. In addition, the average length of documents would also impact the size of this collection. In this example our "feature space" size is dictated by the number of unique words we encounter, the number of unique documents we process, and the average length of each document. Using the Minhashing process, the "feature space" described above could be drastically reduced both shrinking the size of our unique "word" collection and the time required to perform document similarity searches against it.
Here are the basic steps for implementing a Minhashing system:
What this means is that given a collection of 300 unique hashing functions, a 2500 word document now becomes a minimum hash signature containing only 300 integer values which represent a random permutation of all the 2500 words which were originally contained within the document. Integer values are not only typically smaller than words, but the total number of items representing the document is now magnitudes of order smaller than its original form.
The Jaccard similarity between two sets S and T is calculated by dividing the intersection of the sets S and T by the union of sets S and T:
However, the Jaccard similarity between two minhash signatures can simply be approximated by intersecting the values between two minhash signatures . For additional explanation and reading on this topic see section 3.3.3 pg. 80-81 in . To state this in layman's terms, two minhash signatures including 300 minimum hash values from 300 unique hashing functions with 300 matching values between both signatures would be 100% similar while two sets with 150 matching out of 300 total values would be only 50% similar.
Creating a family of n unique hashing functions may sound complicated. However, I was able to create the following function in C# using only a couple of lines of code. John Skeet's post  explains the logic behind this approach in more detail, although my version has been slightly modified to seed each hash function call with two random numbers. Using two random numbers in this way decreases the overall chance of collision and also ensures that the chances of generating two identical random hashing functions would be vary rare. In fact, potential for this event can be eliminated altogether by using a hashset to ensure that only unique random number seeds are selected. It is also important to note that any data type could be used for hashing purposes with the addition of a C# generic input type for the inputData input parameter.
Figure 1 - Function seeded with random numbers to produce any number of unique hashing functions.
The hashing method used above is an FVN hash  which uses XOR bit shifting to create the seeded hash values. This hashing method has a very low chance of collision. While it works exceptionally well for demonstration purposes, it may be possible to achieve more accurate LSH search results by choosing a hashing method which has a higher probability for producing collisions between tokens with similar values. In practice, this is sometimes accomplished by creating a thin layer process directly before hash execution which intentionally converts similar tokens within a given threshold to the same value prior to hashing. This approach is left as an independent exercise for the reader.
Using the LSHHash() function above, any number of unique hashing functions can be created by simply selecting and saving random number pairs to be used as seed values. For example, to create a family of 300 unique random hashing functions, the code below can be used to select 300 random number seed pairs. The 300 seed pairs are always used in the exact same order to produce minhash signature values during the minhashing process. For high performance applications, random number seed hash codes could also be saved in order to avoid calculating the seed's hash code using the GetHashCode() method each time the function is called. In this scenario, the actual hash code value for the seedOne and seedTwo input parameters would be passed into the function instead of the random seed's original value.
Figure 2 - Function used to generate any number of random seed pairs used to create a family of random hashing functions.
Each pair of hash seeds above is saved in a minhash array. The length of this array is always equal to the minhash signature's length. For instance, when hashing function number one is needed, the seed pair from position zero within the minhashes array is provided as input to the LSHHash() function described above.
Once a family of random hashing functions has been created, minhash signatures can quickly be generated using any collection of input data. The following function demonstrates a minhash signature being created using a collection of integers. However, since a hash code can be generated using any data type, the input data could just as easily be a collection of strings, objects, or even binary data containing videos or images. This particular characteristic makes minhashing useful for application in a number of different feature reduction scenarios.
The getMinHashSignature() function takes a collection of any number of integers as input, and then each integer within the collection is hashed by each unique hashing function used to create the minhash signature. Duplicate integer values are also skipped as we can be certain that they will not result in the production of a new minimum hash value. Once again, this collection of integers could just as easily be words pulled from a webpage, gene sequence tokens, objects, or any other type of input data. It is also important to notice that this is where the feature space reduction takes place.
In the example application for this article, we generate 10 collections containing up to 100,000 integers within each collection. The example application creates a single integer collection called "the query" which is then compared to each of the 10 integer collections previously mentioned which we will now call "the documents". The length of individual collections have been randomly determined to better mimic real world similar item searches. First, the Jaccard similarity is calculated between "the query" and "the documents" collections as a similarity benchmark. Next, "the query" and "the document" collections are all minhashed to create one minhash signature for each collection. After minhashing occurs, only 400 minimum hash values remain within each collection of integers resulting in an exponential feature space reduction. Finally, the Jaccard similarity is calculated once again for comparison. The results are shown in the figure below.
Figure 4 - Similar Jaccard similarity values can be seen for each document before and after minhashing occurs.
When using a minimum hash signature including 400 unique hashing functions, the Locality Sensitive Hashing process retains both the highest and lowest scoring document rankings. Nine similarity values are less than 3% in difference after minhashing occurs with a single document's score (#8) showing a 5% difference. Increasing the number of hashing functions used will also create more accurate minhash similarity approximations, if required. The total number of collection items prior to minhashing includes 67,925 integers. After minhashing, this count is reduced to only 4,400 items which includes 400 integer values for each of the 11 total collections which were minhashed. Since the example application generates new collections using random numbers, similar results can consistently be seen when executing the program repeatedly against the different random number collections created.
Locality Sensitive Hashing offers the opportunity for exponential feature space reduction when creating a similar item search system. Both large and small systems can benefit from LSH processing since reduced feature spaces offer dramatic reductions in overall search times and disk space requirements. When reviewing the empirical benchmarks presented above, two feature spaces can be seen producing comparable Jaccard similarity results while the minhashed feature space represents only around 6% of the original feature space's total size.