This article is written by George McIntire.
"A lie gets halfway around the world before the truth has a chance to get its pants on." -Winston Churchill
Since the 2016 presidential election, one topic dominating political discourse is the issue of “Fake News”. A number of political pundits claim that the rise of significantly biased and/or untrue news influenced the election, though a study by researchers from Stanford and New York University concluded otherwise. Nonetheless, fake news posts have exploited Facebook users’ feeds to propagate throughout the internet.
What is fake news?
Obviously, a deliberately misleading story is “fake news” but lately blathering social media discourse, is changing its definition. Some now use the term to dismiss facts counter to their preferred viewpoints, the most prominent example being President Trump. Such a vaguely-defined term is ripe for a cynical manipulation.
The data science community has responded by taking action to fight the problem. There’s a Kaggle-style competition called the "Fake News Challenge" and Facebook is employing AI to filter fake news stories out of users’ feeds. Combating fake news is a classic text classification project with a straight-forward proposition: Can you build a model that can differentiate between “Real” news vs “Fake” news.
And that’s exactly what I attempted to do for this project. I assembled a dataset of fake and real news and employed a Naive Bayes classifier in order to create a model to classify an article as fake or real based on its words and phrases.
There were two parts to the data acquisition process, getting the “fake news” and getting the real news. The first part was quick, Kaggle released a fake news dataset comprising of 13,000 articles published during the 2016 election cycle.
The second part was… a lot more difficult. To acquire the real news side of the dataset, I turned to All Sides, a website dedicated to hosting news and opinion articles from across the political spectrum. Articles on the website are categorized by topic (environment, economy, abortion, etc…) and by political leaning (left, center, and right). I used All Sides because it was the best way to web scrape thousands of articles from numerous media outlets of differing biases. Plus, it allowed to me download the full text of an article, something you cannot do with the New York Times and NPR APIs. After a long and arduous process I ended up scraping a total of 5279 articles. The articles in my real news dataset came from media organizations such as the New York Times, WSJ, Bloomberg, NPR, and the Guardian and were published in 2015 or 2016.
I decided to construct my full dataset with equal parts fake and real articles, thus making my model’s null accuracy 50%. I randomly selected 5279 articles from my fake news dataset to use in my complete dataset and left the remaining articles to be used as a testing set when my model was complete.
My finalized dataset was comprised of 10558 total articles with their headlines and full body text and their labels (real vs fake). The data is located here in this github repo.