Adversarial analytics and business hacking: Amazon case study.
Chances are that you might have purchased a book, or visited a restaurant, as a result of reading fake reviews. The problem impacts companies such as Amazon and Yelp, while on Facebook, massive disinformation campaigns are funded by political money, hitting thousands of profiles and managed by public relation companies: they create fake profiles and try to become friends with influencers. Here the focus is specifically on Amazon book reviews, the Facebook issue will be discussed later, while the Yelp issue is well known and has resulted in a class action lawsuit: Yelp's account managers create bad reviews for restaurants, and if you pay a monthly advertising fee, suddently your rating dramatically improves.
Source for picture: Examples of bogus book reviews on Amazon
Amazon is selling books, so it has a conflict of interest when it comes to book (or product) reviews. The purpose of this article is three-fold:
This is the new project for candidates interested in our data science apprenticeship. The full list of projects can be found here. The project description is as follows:
You will have to assess the proportion of fake book reviews on Amazon, test a fake review generator (possibly using EC2 to deploy the reviews), reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews.
Note that we do not study here the impact of reviews and stars on purchasing behavior or pricing, this will be the subject of another article.
1. Fake review detection
Which metrics would you use to detect fake reviews?
These are features that should probably be included in any fake review detection system. HDT (hidden decision trees) is a great data science technology to design such scoring engines, to score reviews. What other metrics would you suggest?
2. Experimental design and proof of concept: test fake reviews on Amazon
Here the data science apprentice is asked to try various strategies to post fake reviews for targeted books on Amazon, and check what works (that is, undetected by Amazon). The purpose is to reverse-engineer Amazon's review scoring algorithm (used to detect bogus reviews), to identify weaknesses and report them to Amazon.
Strategies will involve
You might have to fine-tune the suggested parameters, to optimize performance of your fake review posting process. Success here is measured by the proportion of 4- or 5-stars books where you managed to reduce the number of stars, to 3 or below. Deliverable is a paper summarizing the results of your test, how scalable your strategy is (can it be automated?) and recommended fixes to make Amazon reviews more trustworthy (that is, designing a better review scoring system). A review scoring system score the reviews, and automatically "review the reviews" to decide which ones should be accepted.
3. The real business risk associated with reviews
Amazon authors are vulnerable to the following fraud, that would eventually result in significant business loss for Amazon.
A start-up company selling good reviews for $500 per book with a $100 monthly fee. It would work as follows.
How scalable is this? A college student could easily make $500 a day, targeting only a few books each day. That's $100k per year, and collect the money via Paypal. Because the money is relatively easy to make, a large number of (educated and under-employed) people could be interested in setting up such a scheme, eventually targeting thousands of authors each day when combined together. Or someone might find a way to automate this activity, maybe using a Botnet, and make millions of dollars each year. Many authors would eventually refuse to have their books listed on Amazon, and choose to self-publish with platforms such as Lulu. Publishers would also opt out of Amazon. Revenue on Amazon (from book sales) would drop. Or Amazon could simply eliminate all reviews and not accept new ones.
Interestingly, it appears that Yelp might be making money with a similar scheme: out of fake reviews and blackmailing small businesses listed on its website. And I've seen companies selling fake Twitter followers or Facebook profiles, though they quickly disappear. Even LinkedIn was recently victim of a massive scheme involving fake profiles automatically generated.
Conclusions
Website relying on reviews (books, products, restaurants reviews, etc.) are vulnerable to massive attacks that could destroy their reputation, and eventually their income.
How could Amazon protect itself from such a risk? Using a better review scoring engine. Relying more on their recommendation engine (user who purchased A also purchased B). Design a better fraud-resistant user reputation engine, and integrate user reputation as a metric in the review scoring engine. Display reviews with high score at the top, or more frequently. Or dropping user-generated reviews altogether.
Also, Amazon could categorize users, so that a data science book review by a user categorized as "interested in web design" does not carry the same weight as a data science book review by a user categorized as "interested in data science". Or a new company could emerge and start competing with Amazon, by offering much better user experience. Such a company could make additional revenue by offering authors the possibility to have their book featured at the top, when a user is searching for books - just like Google does with webmasters who want to promote their website.
Note: I never write reviews, despite the many requests that I receive from authors or publishers. I don't have the time, and I expect to be paid to provide quality content (reviews, bad or good, of high quality). No executive has time to spend on writing reviews anyway, thus if you write a book aimed at executives, you won't get any reviews from fellow executives. In short, all the reviews will be worthless.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
So, I've been working on a method to capture Amazon review data, which poses quite a technical challenge, as reviews have to be scraped since Amazon removed them from its API a few years back. Some books have hundreds of reviews and some reviewers have undertaken thousands of reviews. Looking at some of the reviews from high ranked reviewers is interesting, particularly their velocity, some of which seems somewhat questionable. Is it really possible to review more than one book in a day? This leads me to question what is meant by 'fake review'? A fake review could be an intentional attempt to bias an overall star rating or a review by somebody who hasn't actually experienced the material, and anything in between. Amazon may not want to discredit the reviews of some of its top reviewers, but a consumer might value something like a star rating of a star rating, which indicates the likelihood of a review being 'genuine', whatever that means!
So I'm proposing a slight different project goal to the indicated one. Rather than attempting to beat the system by injecting fake review, I would like to look at developing a reviews recommender algorithm that outputs a reviews quality index. Although scraping the necessary data, particularly data about all or most of the reviews provided by a reviewer, may require some effort to maintain, depending on how often Amazon changes the relevant page layouts. So, unclear whether this modified project goal would be acceptable, but now that I have a reasonable amount of relevant data (actually on 2015 sci fi and fantasy books) I'd like to investigate it anyway.
Initially this project looked like an interesting one to pursue, i.e. improving fake review detection. However:
'Success here is measured by the proportion of 4- or 5-stars books where you managed to reduce the number of stars, to 3 or below.'
Perhaps I have misunderstood but isn't that an ethical concern?
Thanks for clarifying. If you're going to re-post you should put the original author's name, and not just a link. Unattributed copying looks like padding a site with others' content.
Hi Nancy, it was re-posted by our intern Livan. Original author is me.
I first caught this post on bigdatanews.com, where it was reposted under someone else's name. The link was there but it did not give proper authorship credit. Since there is a lot of reposting between these sites, I'll repost my comment from there.
The simple fix that would not be popular is that Amazon could protect themselves by restricting reviews to those who'd actually bought the products from them directly, or that plus some number of other purchases. That would at least up the cost of fake reviews, without having to do a lot more math.
This also called adversarial analytics.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central