Russian Troll Detection By Their Tweets

Dangerous Hooded Hacker Breaks into Government Data Servers and Infects Their System with a Virus. His Hideout Place has a Dark Atmosphere, Multiple Displays, and Cables Everywhere.

This project for me was personal. I experienced the propaganda machine of the Soviet Union and am horrified to see it used on Americans. As a young adult in Soviet Russia, I succumbed to brainwashing and had no idea what was really going on. “Everybody always lies” had been the norm. I came to the US in the 90s and was amazed that in the US, deception is not normalized as in my homeland. To my surprise, Americans trusted my words while Soviet people would look at me with suspicion regardless of the situation. Now Americans’ trust has been abused by paid trolls on social networks and it saddens me. I worry that deception will be normalized here in the States as well. I believe that propaganda is a form of psychological abuse.

When I saw on Kaggle that the Russian Troll Tweets data had become available, I wanted to help prevent people from getting brainwashed like I was. The difficulty with the data is that Russian propaganda uses Göbbel’s 60% truth for 40% lies method, meaning that most tweets look like regular people’s tweets. For example, some copy and publish a lot of American posts about fitness. Nevertheless, they work diligently following their job instructions, and this can be detected.

I did my work in Python and published it on Kaggle. It consists of 3 parts due to Kaggle resources restrictions. The links for them are at the article end.

Part 1. EDA

My findings:

There could be very few covert Russian letters in the English texts of Russian trolls. Around 0.067% of tweets marked as English contain Russian letters, although around 10% of all accounts hold them in some tweets.
Tweet publish times could be too close to each other, like one account had spent 67 hours posting at least 40 tweets during each hour. A human being cannot sustain such activity, although the vast majority of accounts are not so obvious.
They dutifully include settled 2-and 3-word combinations to push their agenda.
Occasionally they make errors typical for Russian natives who translate their message into English.
In the majority of cases, they imitate Americans rather well.

Part 2. Feature Engineering

My findings:

The additional datasets which I used have less information compared with Russian Troll tweets. As result, the combined data are not balanced with respect to troll presence. There are no proper datetime columns so I cannot use clustering by post time.

The amount of posted texts in the additional datasets does not allow a proper comparison of n-gram frequencies.

Nevertheless, we can detect differences in account behaviors judging by troll propaganda efforts. Below goes a partial list of their tricks.

Trolls actively use URLs, Twitter handles, and hashtags to spread their information as wide as possible
It appears that trolls have specific guidelines for message length.
In general, they tend to use longer words in their messages than normal people do. It could be because they should include particular words and expressions.
I got the same emoji frequency for trolls and non-trolls. The second non-troll set does not contain them at all, so I believe that usually, trolls do not use many emojis. Emoji absence tells us that troll’s messages are not personal, just work.

Trolls produce a bit more errors in English than non-trolls, although not in a way I expected and they are much less significant for their detection. Nevertheless, their usage of punctuation signs slightly differs from native English speakers and can serve for fine distinction.

I used the Mutual Information score to find dependencies with output value and then thinned out variables by Pearson correlation.

Part 3. Machine Learning, an accuracy on tests is 99.6%

My findings:

My plan was to use Deep Learning for this data set. I looked up the works of others and checked the data myself. I discovered that the approach did not yield good classification results, so I decided to add more features to create weak predictors. The “weak” predictor showed up as rather strong.
I noted that the most prominent properties are the ones related to propaganda techniques. Apparently, trolls have specified guidelines and they stick to them. It appears convenient because we can set up filters for catching the most significant phenomena and then check whole account activity.
In addition, the most important prediction features turned out to be not very dependable on languages but mostly on troll account methods. Thus we can do it for other languages, and do not limit it to Russian trolls posting English texts.

I did a customized sci-kit-learn Transformer for my features to use with a Pipeline.

Here are my kaggle notebooks. There are some repetitions with data cleaning at the start of each part. Please upvote if you like my work!

Part 1. EDA tinyurl.com/kwfpxd28

Part 2. Part 2. Feature Engineering http://tinyurl.com/2p97cv4f

Part 3. Machine Learning, an accuracy on test is 99.6% http://tinyurl.com/mrytb676