Home » Uncategorized

Black Hat Data Science

Black hat data science consists of techniques designed to fool existing algorithms (Google search, Amazon rankings, and so on), compromising or tampering with the metrics — especially ratios — that they rely on, without actually physically touching or altering data stored in their databases. It exploits flaws in these algorithms, and it also relies on reverse engineering, to achieve its goal. So black hat data science is different from traditional hacking, which physically accesses the data to transform or steal it. Traditional hacking is considered a criminal activity, while black hat data science is not (though it may be considered unfair business practice.) Black hat data science, if done properly, may be very difficult to detect, more than traditional hacking that usually leaves intrusion trails.


1. How does it work: examples

We illustrate the concept with a few examples. Some have been in use for over 15 years.  

Twitter account deletion

Some undesirable Twitter accounts use all sort of tricks to get more visibility and have their tweets show up at the top in your Twitter feed. It typically involves the creation of a large number of Twitter accounts, using many IP addresses to generate them, and then retweeting each other’s tweet across all the accounts participating in the scheme. Think about a scheme used to promote some political propaganda. If done too hard, all these Twitter accounts, including the original one (the master account), will  be detected and blocked by Twitter, so the bad guy — like Cambridge Analytica — is eventually punished.

Black hat data science, in this context, is used NOT to boost your tweets with a network of fake accounts, but instead using these fake accounts to boost someone else’s tweets. Typically a bad guy that everyone would be happy to see blocked. It involves testing and analyzing how Twitter algorithms respond to your activity, to optimize efficiency. Thus, it is data science, and can be used if contacting Twitter or lawyers does not produce any results.

Diluting fake reviews on Amazon or Yelp

Your restaurant is on Yelp, or your book for sale on Amazon, and you get flooded with bad reviews, typically by competitors or organisations with some agenda in mind (for instance, an organisation that does not like the fact that you serve foie gras in your restaurant, and decided to attack you.) Talking to a lawyer is of no help. The solution could consist of using black hat data science instead, creating numerous fake accounts to generate enough good reviews, in order to achieve the proper level of dilution without being caught. Again, it requires testing and analyzing how Amazon or Yelp relevancy algorithms work, in order to successfully defeat them. Since these algorithms are not robust, generating plenty of false positives and false negatives (which you have to study) it might be easier than you think to fool these algorithms.

Blacklisting a defaming website on Google

Someone has a website that is defaming your reputation, and for whatever reason, it shows up first on Google search even though it is 5 years old. Contacting lawyers, their ISP, or Google, is of not help. Yet the fact that the bad results show up at the top, means that the Google indexing algorithm is very weak  and badly designed  (it can’t accurately identify illegal content for instance): a fact that you can exploit to your own advantage. In Europe, you have the “Right to be Forgotten” legislation to help with this. But not in US, and that’s where black hat data science becomes handy.

By using carefully crafted techniques, you can get the website in question automatically banned on Google. The techniques are similar to using black hat SEO, except that black hat SEO is typically used by a webmaster to boost her Google visibility, usually resulting in the opposite: getting your own website banned. Here you use black hat SEO against someone else’s website, to get it banned.  

One way to do it is to use weblog spamming, using the bad site URL’s in the scheme, rather than your own URL’s. Again, it must be done right to work properly, thus I consider this to be data science rather than SEO (search engine optimization.) If done poorly and the target website fails to get banned, you are actually promoting them (though this technique could still get them banned by their ISP.)

Automatically erasing unwanted ads

This was successfully tested on Google ads long ago. Typically, ad relevancy algorithms compute the ratio of impressions to clicks (click-through rate, or CTR) as one of the main ingredients to assess how much to charge by click, whether the ad in question should be in position #1 or #5, or not shown at all. By fabricating a large number of impressions and very few clicks for the ad in question, you might be able to stop it from showing up altogether. You need to use many IP addresses in this technique, randomization techniques, find what very low artificial CPR works best, avoid being detected as a robot, not over-do it (otherwise the algorithm will notice the attempt to fool it) and thus it involves some testing. It is data science. You might want to use this technique for instance, against malicious ads, or if you believe your own ads (as an advertiser) are unjustifiably penalized while questionable competitors (probably using black hat data science of their own) always show up at the top. 

Killing email spam

Usually there is an unsubscribe button that email marketers must legally use in all their outbound messages. But the worst offenders, and scammers, will keep flooding your mailbox with unwanted messages especially if you unsubscribe (it means for them that your email address is active and then they can re-sell it to other spammers.) One way to deal with this is to go to their website and subscribe to their “newsletter” using many fabricated email addresses (possibly Gmail or Yahoo ones) via different IP addresses, or blacklisted email addresses or email addresses used by spam monitoring systems to entrap spammers. Email addresses such as [email protected] (though the spammers might recognize that one) or better, more discrete ones. Collecting the right set a email addresses for this purpose, is the data science part. It is not easy because these spammers use email list cleaning services to precisely avoid being attacked in this manner. But you can use the same email list cleaning services to detect email addresses that are not red-flagged, to fool the spammers.

2. How to implement black hat data science

As discussed, to make it works, you need to use many IP addresses. Anonymous proxies, as well as activity coming from a same machine, even if using 200 different IP addresses, won’t work. I suggest here two options that are legit (so not something like a Botnet.)

Small scale option

Get 10 interns, each one using 3 devices, with an average of 2 browsers per device, allowing them to create 6 accounts each. Then you can simulate the activity of 60 accounts. This will work for restaurants unjustly penalized, for a cost below $1,000 per restaurant, to get rid or dilute the hateful comments.

Larger scale option

Since this is all about cleaning the Internet, it might be possible to reach out to a large community of activists, as this business model could resonate well with them. Get 20,000 of such activists, pay them money the way Bitcoin pays money to Bitcoin miners (using Blockchain technology) and sign up clients. Reputation management and brand improvement cost sometimes more than $100,000 per case yet usually provide little value. I believe this technology could provide a much better ROI, and that you could charge selected clients $20,000 to fix their situation (far below what a lawsuit would cost.) Obviously, you would have to be very selective about which clients you want to work with, making sure that you only serve the good guys who have been erroneously penalized by a faulty algorithm. 

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources