Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.
Here, we work with complete click data collected over a 7-day time period. Let's assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).
The data set (ideally, a tab-separated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh...), referral (actual referral domain or ad exchange domain, blended together, argh...), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.
The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:
These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.
Building a summary table: the Map step
The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:
The list of UA's, for a specific bucket, looks like ~6723|9~45|1~784|2, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.
In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:
Updating the list of UA's associated with a bucket is a bit less easy, but still almost trivial.
The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the Map step, in Map-Reduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.
Building a summary table: the Reduce step
Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can't simply use hash table here, because they will grow too large and it won't work - the reason why we used the Map step in the first place.
Here's the work around: ...