By web spam, we mean any technique - using Botnets or other forms of fake clicks - to manipulate web traffic statistics, to make your articles appear at the top on search results pages or other list of top articles. Web spam techniques exploit weaknesses in traffic monitoring algorithms. In the most simplest form, a rogue author will crawl his articles dozens or hundreds of times a day, hoping to be featured in the list of most popular articles.
Figure 1: articles #2 and #3 were artificially boosted to show up at the top
In Figure 1, data from Google Analytics shows evidence of web traffic manipulation (web spam): the average time spent per page is 60 times too low for pages #2 and #3. Indeed, the numbers are so abnormal, and the counts so large for these two pages, that the average time spent per page was well above average, for the entire day in question.
Generally speaking, how do we detect these outliers?
Create a data dictionary, and check out each core metric every day (page views, unique users, time spent per page, top high-level browsers breakdown, top countries breakdown, top pages breakdown, top referrers). Any time a value is significantly different from a normal day, drill down on other metrics to identify the cause. This can be fully automated and designed as an email alert system. It took 2 minutes to understand the problem in this particular (rather simple) instance, and we even found the culprit: the authors of articles #2 and #3 are identical, and these articles were posted recently