Massive Internet Attack Floods the World with Fake Data

Reddit is now at the center of this attack that impacts millions of top domains (most of the Internet) since November 30. While Reddit appears at first glance as the perpetrator, it is actually the victim. This “behind the scene” scheme run from Russia generates huge amounts of fake traffic – as much as 10% of the entire Internet traffic.

It is not caught by Google Analytics, and thus it results in phony web traffic statistic and flawed reports, which is the main issue people are complaining about. It is not mentioned in any media, as far as I know. The attack, even though massive, looks rudimentary. I will explain the details shortly. It is launched either by a hacker playing some old tricks to a new scale (probably in collusion with a few Russian ISPs), or by professional criminals testing some devices, doing a rehearsal, testing how far they can go before being detected, or trying to distract us from a far more nefarious but smaller scale attack taking place at the same time.

At this point, this ongoing attack is a nightmare mostly for web analysts, webmasters, and some data scientists, though any data scientist worth her grain of salt should be able to precisely identify the fake traffic, and thus correct the phony numbers. Such attacks occured in the past (from other countries), but this one is the biggest that I have ever seen. The user visiting the websites impacted by the fake traffic won’t notice anything: it is happening behind the scene. It is not a DoS (denial of service). attack impacting a few domains with highly concentrated traffic to knock them down, but instead smaller traffic volumes (per targeted domain) impacting millions of websites. If it was 10 times bigger, I would imagine that many websites would go offline though. The perpetrator is clever enough to maintain his scheme alive (avoiding being blocked) by not hitting too hard. Or maybe he has reached his limit in terms of available bandwidth.

How is Reddit involved?

The fake (non-human) clicks come with a fake referrer. Initially, on November 30, it started with lifehacĸer.com as if the traffic was coming from that domain, but indeed the traffic was manufactured with a robot, not real humans.In the last section, we show source code that can generate such fake traffic, faking both the browser and the referrer field, so that when the victim checks his web traffic statistics, the top referral is now a fake. Typically, hackers who plant fake referrer domains use their own domain, they use this scheme as a way to generate free traffic: if dozens of million of fake referrers are planted across millions of sites, you would expect many web analysts and webmasters to check out the referral domain that suddenly seems to be generating such a big proportion of their traffic. At least this is the way this scheme has been used in the past.

Note that in the case of lifehacĸer.com (the domain used by the fraudsters on November 30) the letter k is not actually k, instead it is a cyrillic character that looks very much like k. Compare the two versions: lifehacĸer.com (with a Cyrillic character) with lifehacker.com (with a k.) So the fraudster tried to leverage this confusion.

Starting on the second day, and still today, the domain being used changed from lifehacĸer.com to reddit.com. Indeed, the full URL planted in millions of web logs suddenly became

https://www.reddit.com/r/technology/comments/5foynf/lifehac%C4%B8er…,

as if Reddit suddenly started to spam the whole Internet. Yet the traffic still originated from the same Russian locations, using the same (possibly fake) browser Safari, version 9. Interestingly, the Reddit link in question is the only article (besides this very article) talking about the attack. So the hacker decided to plant fake Reddit referrers in web logfiles across the world. Doing so could get Reddit blacklisted by Google, as Google algorithms could think that Reddit is using black hat SEO tricks to boost its traffic, something that typically gets a website blocked on Google. If instead of Reddit, the hacker would plant fake referrers using thousands of various domains, he could get many websites blocked on Google. Is that the plan? Probably not.

Why is this traffic not blocked? How to deal with this attack?

It will eventually be blocked, though it tends to adapt to blocking, and usually comes back in a slightly different form. It is not filtered out by Google Analytics, which means that the hacker, via the fake clicks, is able to trigger the Javascript code found on all web pages that use Google Analytics for tracking and analytic reporting purposes. Typically, Google Analytics filter out very little traffic, if any. It automatically (by design) filters out most robots, as robots typically do not trigger Javascript code found on webpages. But this one does, so the hacker must have gone the extra mile to add this feature to his web robot.

Are Alexa.com statistics also impacted by this robot? Alexa did not update its website rankings for several days, which is unusual. It did update the numbers on December 1, but now all the numbers are off. My guess is that this is not related to the lifehacĸer.com attack, but instead it is related to some changes in the way Alexa ranks websites, which coincidentally happened concurrently with the attack. For instance, Alexa could have added many subdomains to its list of websites, or using a different time frame (3 months rather than last 30 days) to compute the website ranks, explaining why so many websites now suddenly have a rank that is significantly worse.

It is easy to block the fake traffic at the web server (Apache) level, click here for details. And as always, the most robust traffic metric for your website is the number of new members, assuming you are able to detect and reject sign-ups from spammers and other undesirable people or robots. In this attack, no (fake) new members are being added. But the number of sessions, pageviews, and even (to a lesser extent) users, are impacted.

What are the hacker’s motivations? Why is the attack so rudimentary?

The attack is not carried out by a data scientist, or if it is, it must be by a very dumb one: it is so easy to identify the fake traffic, based on location, browser, and referrer. It is as if the attacker wants you to discover the fake traffic, and the extent of the attack, and he is smart enough to keep it going, avoiding blocking. He is probably not acting alone.

The hacker must have a database of millions of websites (the victims) with some indication of traffic volume for each website. Indeed, websites with lots of traffic are hit harder in terms of total number of fake clicks, but not so much in terms of proportion of fake traffic. Such lists of websites are easy to come by (I have my own based on years of web scraping) and some of them are even public. Quantcast used to publish such a list for the top one million websites, you can still find it here, but it is clearly outdated: many of the target websites (victims) that I checked were not on that list, despite their traffic volume.

As for the motivations for doing this unusual attack, I don’t know. It could be to prove that the attacker is smarter than Google Analytics (in some ways, he is.) Obviously, anyone carrying an attack must use the dumbest possible technique that will work, to avoid revealing advanced tricks to the people trying to catch or block you. If it works even though it is rudimentary, so be it, it is good news for the hacker. That said, there is some level of sophistication in it, but it is from a software rather than statistical engineering point of view. For instance, it must be deployed in some distributed environment to successfully generate so many clicks in so little time. But the algorithm that does that is actually a textbook example about how Map-Reduce works. From a statistical engineering point of you, you could not design something more dumb than that though. Yet, I imagine that the hacker will add a bit of statistical engineering in his next release. Or use a Botnet instead.

Interestingly, I’ve found an article entitled A Russian Trump fan is celebrating by hacking Google Analytics, though this could just be another piece of fake news.

Source code to plant fake referrers

The source code below is very basic: while it plants fake referrers, it does not trigger the Javascript code used by Google Analytics to track traffic. Click here for more details. It is also one of many different ways to achieve the same results — and clearly the hacker did not use such a script here — otherwise we would likely see tons of (fake, simulated) browsers associated with the attack, not just Safari version 9.

#!/usr/bin/perl

use LWP::UserAgent;

$ua = LWP::UserAgent->new;
$ua->agent(“Fake Browser: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)”);
$ua->timeout(2);
$ua->env_proxy;
$ua->max_size(64000);

# Create an HTTP request
my $req = HTTP::Request->new(GET => ‘http://www.TheVictim.com’);

$req->header(Accept => “text/html, */*;q=0.1”, referer => ‘http://www.FakeReferrer.com/’);

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
if ($res->is_success) {
print $res->content;
} else {
print $res->status_line, “\n”;
}

Reddit thread discussing the attack

Top DSC Resources

Article: What is Data Science? 24 Fundamental Articles Answering This Question
Article: Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
Tutorial: Data Science Cheat Sheet
Tutorial: How to Become a Data Scientist – On Your Own
Categories: Data Science – Machine Learning – AI – IoT – Deep Learning
Tools: Hadoop – DataViZ – Python – R – SQL – Excel
Techniques: Clustering – Regression – SVM – Neural Nets – Ensembles – Decision Trees
Links: Cheat Sheets – Books – Events – Webinars – Tutorials – Training – News – Jobs
Links: Announcements – Salary Surveys – Data Sets – Certification – RSS Feeds – About Us
Newsletter: Sign-up – Past Editions – Members-Only Section – Content Search – For Bloggers
DSC on: Ning – Twitter – LinkedIn – Facebook – GooglePlus

Massive Internet Attack Floods the World with Fake Data

Leave a Reply Cancel reply