I just got back from my vacation in Barcelona, Spain where I spent about 3 days, then rented a car and drove up north through the South of France. My last stop was Nice, France. The trip was a lot of fun and now I intend to find some data to help bring back great memories (hm...it sounds more geeky than I thought but anyway).

Barcelona is located in Catalonia region of Spain famous for its earthy dry reds as well as Cava - world's most delicious bubbly drink. I am a big wine fan which is why I thoroughly enjoyed the trip and decided to write about wine today.

If you wanted to analyze any wine-related topic which one would you choose? Would you map popular vineyards around the world on the heatmap? Or maybe you will play with different grape varieties and figure out which ones people like the most?

Every time I enter a wine store I always wonder - why do certain wines get carried by almost every liquor store and others look brand new to me (I've never seen them in other stores)?

So what does determine a given store's wine selection? Do managers already buy popular wines most people drink and are, therefore, trying to "spice up" their menu with bottles other stores will definitely not have? Or maybe they study customers and select bottles based on consumer preferences?

My big question is what makes wine sell?


What I am trying to solve here is a basic economics problem about demand, supply and equilibrium. If a store wants to be successful it will eventually buy only wines customers prefer and will sell in quantity and at the price optimal for the market. However, in order to find that equilibrium I need data for both supply and demand.

Wine.com is one of the biggest wine e-commerce retailers in the US. It allows customers to buy wine according to any price range, grape variety, country of origin, etc. It even has a public API that helps developers gain access to a large wine database and sync it up with their applications. In their wine catalog wine.com captures the following data:

  • Wine Name
  • Year (2001 - 2014)
  • Grape Variety (Red, White, Sparkling, Rosé, Dessert)
  • Region (wine producing region of the country)
  • Country of origin
  • # Reviews (customers gave to a specific bottle)
  • Original Price
  • Discounted Price
  • Regular Price (if no discount was applied)
  • Final Price (price at which wine was being sold)
  • Rating Name (source of customer ratings)
  • Rating Score (overall customer rating score)

In the past I received quite a few emails from the readers asking for datasets I used in my analyses. Now I figured out a way to give my readers access to the original data right in my blog post. So I am happy to announce

Insight Mine New Feature

Get Original Data from xdataset

A few stats about the dataset:

  • It contains 6,613 wine bottles sold on Wine.com produced between 2011 and 2014
  • 42% of them have rating score given by customers
  • 39% of them have at least 1 customer review available
  • Final price ranges between $4.99 and $339

Descriptive Analysis

From the interactive visualization below we can see that only wines from US, Western Europe, Argentina, Chile, Australia, South Africa and New Zealand are sold in the US. No Eastern Europe (how one can forget amazing Kindzmarauli from Georgia?), Asia or Middle East were on the list. I wouldn't be surprised if I saw this heatmap 2 weeks ago, but during one of my trips to spanish vineyards our guide mentioned that China is a rapidly emerging wine market and is expected to be the world's largest producer in the next 5 years.

A cool thing about this visualization is that you can pick any grape variety you like and check out where this wine is being produced the most. For example, I really like Merlot but I've always thought that it is mainly produced in the US, France and Italy. From the map it looks like Argentina and New Zealand also produce this wine which gives me another reason to go to a wine store this evening.

Bottom left scatter plot shows each bottle plotted based on price and rating it received from customers. If you look at all types of wine you may describe relatioship between price and ratings as y = α√x, but if you start segmenting wine by each type you will see diverging patterns. For example, with red wines it is generally true that higher price results in higher rating scores, but with white wines we see that some cheap bottles under $30 receive 90+ rating scores (e.g. Emilio Lustau Sherry from Spain) and some expensive bottles for $55 barely hit 90 in rating score (e.g. Louis Jadot Chardonnay from France). With bubbly drinks things get even more interesting. The pattern looks similar to red wine but is taken to the extreme. Vast majority of sparkling bottles are not pricy and get ratings somewhere in mid 90s, but there are also some serious outliers like Dom Perignon Limited Edition or Louis Roederer Cristal Brut that will cost you $200+ and will be worth the price.

Finally bottom right chart is our top 15 wine list (based on customer ratings) for crazy wine lovers. Gold, Silver and Bronze from different Chateau all go to France and also cost a fortune. But there is one spanish bottle in the 11th position which I may try tonight as it is relatively cheap compared to its peer winners -Faustino I Gran Reserva 2001 from Rioja region of Spain.

Click here to access high quality, interactive version of this chart

Statistical Analysis

Now when I played around with data I can start answering my question about wine market equilibrium. I will make the following assumption:

  • Rating score is a predictor of how well a bottle will sell. So in the absence of wine.com sale data I will use rating score as a proxy. This will be my demand variable.

So my analysis question is: What drives wine rating score and, hence, increases chances of a bottle to be sold?

I will try the following model for my hypothesis testing

Rating Score(i) = β0 + β1√Final_Price(i) + β2 Discount(i) + β3*N_Reviews(i) + εi


  • Rating Score - Rating score received by customers (sale proxy)
  • Final_Price - Price at which a bottle was offered
  • Discount - Whether discount was applied (1 - yes, 0 - no)
  • N_Reviews - # Reviews a bottle received

I used R lm function to perform the analysis and assess each factor's impact on the post's popularity. Here is the output of my regression analysis.

Alt text

R-squared in this model was 0.53 which is pretty good given we only had price of a bottle, reviews and discount as potential predictors of a rating score. We see from the result output that baseline rating score of wine.com bottles is 87.5. Every dollar added to the price increases bottle's rating by √0.72 = 0.85 points and each additional review bumps rating score by 0.04 points (or 25 additional reviews increase rating by 1 point). The latter finding is probably not that actionable since it takes a lot of reviews to move rating just a little bit. Presence of a discount does not affect wine's rating score.

But remember I mentioned earlier that depending on the type of wine rating score can have very different relationship with price? So I decided to run individual regressions on each wine type. Here is what I found

Alt text

For both red and sparkling wine my model was pretty well explained by the data I had. R-squared with reds was 0.58 and with bubbly it was 0.77. But in case of white, rosé and dessert wines R-squared was significantly lower (0.3, 0.3 and 0.43). So while my overall model was sufficiently well explained by the data when I dug into each wine type I got somewhat high variability in the ability of the data to explain my dependent variable.

Coefficients also had different magnitude. While final price was always a significant predictor of a rating score, it affected rosé wines to a lesser extent than dessert wine. # Reviews did not explain wine's rating score in every wine type. For example, it did not affect rosé and dessert wine ratings. It did, however, affect sparkling wine rating score negatively, but given how little this variable changes rating score overall, this finding may just be an artifact of my data.

The most interesting variable was presence of a discount. Remember it did not correlate with rating score in my general model? Once I segmented by wine type, it turned out to be a significant predictor of white, rosé and sparkling wine ratings. Discounted bubbly wine was rated higher while discounted white and rosé were rated lower.


Based on all findings I gained from this analysis, here is my insight into what does and does not make wine sell:

  • Generally price has a significant positive effect on wine's rating, but this relationship is not linear. Wines in the price range of $0 - $100 instantly gain appreciation with price increase (the higher the price, the better the quality, the higher the rating), but once the price point reaches $100 (by the way, I totally eyeballed $100 threshold) increase in price doesn't really do anything to the rating.
  • Sparkling wine ratings are the only ones that benefit from discounts. After all who won't be excited about 30 bucks off of a $250 bottle?
  • Contrary to a mainstream opinon, reviews don't move the needle in wine market. Whether you have 5 or 10 people commenting on your wine, its rating won't go up that much.

Why these findings won't help Wine.com

While my findings may or may not benefit this large online retailer, I'm sure Wine.com is already doing a bunch of analytics on their sales operations. Local businesses are the ones who would benefit from this type of analysis and will be able to tailor their sales strategy to customer needs if they understood them better.

According to Wine Searcher, there are about 1,150 local wine stores in New York. They all have websites and try to drive customers into the store. Most of them don't collect data for analytical purposes and don't use data to make critical decisions. I would love to give them analytics they need but there is one problem. Unlike Wine.com local businesses follow different website structure and, therefore, scraping needs to be customized for each and every website. I can't automate and scale scraping programmatically, but we can do it together! ☺

Here is how you can help me:

  • Fill out this form and let me know what wine store your would like to contribute with.
  • Download Kimono Labs plugin for Chrome and take a look at how they explain how to scrape data from the website -- it's super easy!
  • Download scraped csv file from Kimono Labs.
  • Obviously you are free to scrape howevery much information you want but here are fields that I will need to aggregate information into one file
    • Wine Name
    • Wine Type
    • Wine Region
    • # Reviews (where available)
    • Rating score (where available)
    • Wine Price
  • Sign up for xdataset portal where you can store your data.
  • Upload your dataset into xdataset libraryPlease note that once you select the dataset from your computer, just go ahead and click "Submit".
  • Please make sure you have column headers in your file, that the file is in .csv format and that you have "Wine" in the name of your dataset.

Once you upload your datasets I will injest them on my end and will aggregate them into 1 master dataset which I will then email to all contributors. At this point we all will have access to the dataset no one else has access to!

This is a pilot batch of 25 wine stores so if each of us spends half an hour to scrape one website then in 30 minutes we will have a dataset one person will have to spend 25*0.5 = 12.5 hours on.

It is Time to Bring Your Own Data!

Originally posted here.

Views: 2352

Tags: Analytics, BigData, Data


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service