# Fast clustering algorithms for massive datasets

Here we discuss two potential algorithms that can perform clustering extremely fast, on big data sets, as well as the graphical representation of such complex clustering structures. By extremely fast, we mean a computational complexity of order O(n)and even faster such as O(n/log n). This is much faster than good Hierarchical Agglomerative Clustering which are typicallyO(n^2 log n). By big data, we mean several millions, possibly a billion observations.

Potential applications:

• Creating a keyword taxonomy to categorize the entire universe of cleaned (standardized), valuable English keywords. We are talking of about 10 million keywords made up of one, two or three tokens, that is, about 300 times the number of keywords found in a good English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon on Google (for pay-per-click ad campaigns), to better price them. This is the application discussed in this article.
• Clustering millions of documents (e.g. books on Amazon.com) or
• Clustering web pages, or even the entire Internet, which consist of about 100 million top websites - and billions of web pages.

We also discuss whether it makes sense to perform such massive clustering, and how Map Reduce can help.

A. How to build a keyword taxonomy?

Here's the answer, from my earlier article What MapReduce can't do. Step 2 is the clustering part.

Step #1: pre-processing

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A="California insurance", B="home insurance", x=543, y=998, z=11

where

• x is the number of occurrences of keyword A in all the web pages that you crawled
• y is the number of occurrences of keyword B in all the web pages that you crawled
• z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 100 million entries.

Note: This step #1 constitutes the final step of a number of interesting applications.  For instance, it is used in search engine technology to identify or recommend keywords related to some other keywords. See an application here. This keyword API will be available (with source code and data) to students participating in our data science apprenticeship and (upon request) to bloggers posting good quality content on our network. Interestingly, a similar app was licensed to search engines (by Ask.com) for \$500,000 a year, a few years ago.

Step #2: clustering

To create a taxonomy, you want to put these keywords into similar clusters....

Views: 271

### Replies to This Discussion

1. I think no less than three words a strings/phrases should have: they must answer these questions - "What?", "What is going on with 'What'?" and "How does it look like?" I believe these questions describe all aspects of 'What'. Plus, in English, an article and diffirent service words.

Their number is great but limited.

2. They are to be indexed by dictionary, it's 300.000 words, plus-minus: everything can be educed to words. Therefore, potentially the number of phrases is 300.000 in the third degree.

3. I propose to cluster based on paragraphs; where a paragraph is a subdivision of a written composition that consists of one or more sentences, deals with one point or gives the words of one speaker, and can be extracted from text based upon textual indicators such as, for example, a hard return or tab.

4. Internet is going to be the regular database.