I just got back from my vacation in Barcelona, Spain where I spent about 3 days, then rented a car and drove up north through the South of France. My last stop was Nice, France. The trip was a lot of fun and now I intend to find some data to help bring back great memories (hm...it sounds more geeky than I thought but anyway).
Barcelona is located in Catalonia region of Spain famous for its earthy dry reds as well as Cava - world's most delicious bubbly drink. I am a big wine fan which is why I thoroughly enjoyed the trip and decided to write about wine today.
If you wanted to analyze any wine-related topic which one would you choose? Would you map popular vineyards around the world on the heatmap? Or maybe you will play with different grape varieties and figure out which ones people like the most?
Every time I enter a wine store I always wonder - why do certain wines get carried by almost every liquor store and others look brand new to me (I've never seen them in other stores)?
So what does determine a given store's wine selection? Do managers already buy popular wines most people drink and are, therefore, trying to "spice up" their menu with bottles other stores will definitely not have? Or maybe they study customers and select bottles based on consumer preferences?
My big question is what makes wine sell?
What I am trying to solve here is a basic economics problem about demand, supply and equilibrium. If a store wants to be successful it will eventually buy only wines customers prefer and will sell in quantity and at the price optimal for the market. However, in order to find that equilibrium I need data for both supply and demand.
Wine.com is one of the biggest wine e-commerce retailers in the US. It allows customers to buy wine according to any price range, grape variety, country of origin, etc. It even has a public API that helps developers gain access to a large wine database and sync it up with their applications. In their wine catalog wine.com captures the following data:
In the past I received quite a few emails from the readers asking for datasets I used in my analyses. Now I figured out a way to give my readers access to the original data right in my blog post. So I am happy to announce
A few stats about the dataset:
From the interactive visualization below we can see that only wines from US, Western Europe, Argentina, Chile, Australia, South Africa and New Zealand are sold in the US. No Eastern Europe (how one can forget amazing Kindzmarauli from Georgia?), Asia or Middle East were on the list. I wouldn't be surprised if I saw this heatmap 2 weeks ago, but during one of my trips to spanish vineyards our guide mentioned that China is a rapidly emerging wine market and is expected to be the world's largest producer in the next 5 years.
A cool thing about this visualization is that you can pick any grape variety you like and check out where this wine is being produced the most. For example, I really like Merlot but I've always thought that it is mainly produced in the US, France and Italy. From the map it looks like Argentina and New Zealand also produce this wine which gives me another reason to go to a wine store this evening.
Bottom left scatter plot shows each bottle plotted based on price and rating it received from customers. If you look at all types of wine you may describe relatioship between price and ratings as y = α√x, but if you start segmenting wine by each type you will see diverging patterns. For example, with red wines it is generally true that higher price results in higher rating scores, but with white wines we see that some cheap bottles under $30 receive 90+ rating scores (e.g. Emilio Lustau Sherry from Spain) and some expensive bottles for $55 barely hit 90 in rating score (e.g. Louis Jadot Chardonnay from France). With bubbly drinks things get even more interesting. The pattern looks similar to red wine but is taken to the extreme. Vast majority of sparkling bottles are not pricy and get ratings somewhere in mid 90s, but there are also some serious outliers like Dom Perignon Limited Edition or Louis Roederer Cristal Brut that will cost you $200+ and will be worth the price.
Finally bottom right chart is our top 15 wine list (based on customer ratings) for crazy wine lovers. Gold, Silver and Bronze from different Chateau all go to France and also cost a fortune. But there is one spanish bottle in the 11th position which I may try tonight as it is relatively cheap compared to its peer winners -Faustino I Gran Reserva 2001 from Rioja region of Spain.
Click here to access high quality, interactive version of this chart
Now when I played around with data I can start answering my question about wine market equilibrium. I will make the following assumption:
So my analysis question is: What drives wine rating score and, hence, increases chances of a bottle to be sold?
I will try the following model for my hypothesis testing
Rating Score(i) = β0 + β1√Final_Price(i) + β2 Discount(i) + β3*N_Reviews(i) + εi
I used R lm function to perform the analysis and assess each factor's impact on the post's popularity. Here is the output of my regression analysis.
R-squared in this model was 0.53 which is pretty good given we only had price of a bottle, reviews and discount as potential predictors of a rating score. We see from the result output that baseline rating score of wine.com bottles is 87.5. Every dollar added to the price increases bottle's rating by √0.72 = 0.85 points and each additional review bumps rating score by 0.04 points (or 25 additional reviews increase rating by 1 point). The latter finding is probably not that actionable since it takes a lot of reviews to move rating just a little bit. Presence of a discount does not affect wine's rating score.
But remember I mentioned earlier that depending on the type of wine rating score can have very different relationship with price? So I decided to run individual regressions on each wine type. Here is what I found
For both red and sparkling wine my model was pretty well explained by the data I had. R-squared with reds was 0.58 and with bubbly it was 0.77. But in case of white, rosé and dessert wines R-squared was significantly lower (0.3, 0.3 and 0.43). So while my overall model was sufficiently well explained by the data when I dug into each wine type I got somewhat high variability in the ability of the data to explain my dependent variable.
Coefficients also had different magnitude. While final price was always a significant predictor of a rating score, it affected rosé wines to a lesser extent than dessert wine. # Reviews did not explain wine's rating score in every wine type. For example, it did not affect rosé and dessert wine ratings. It did, however, affect sparkling wine rating score negatively, but given how little this variable changes rating score overall, this finding may just be an artifact of my data.
The most interesting variable was presence of a discount. Remember it did not correlate with rating score in my general model? Once I segmented by wine type, it turned out to be a significant predictor of white, rosé and sparkling wine ratings. Discounted bubbly wine was rated higher while discounted white and rosé were rated lower.
Based on all findings I gained from this analysis, here is my insight into what does and does not make wine sell:
Why these findings won't help Wine.com
While my findings may or may not benefit this large online retailer, I'm sure Wine.com is already doing a bunch of analytics on their sales operations. Local businesses are the ones who would benefit from this type of analysis and will be able to tailor their sales strategy to customer needs if they understood them better.
According to Wine Searcher, there are about 1,150 local wine stores in New York. They all have websites and try to drive customers into the store. Most of them don't collect data for analytical purposes and don't use data to make critical decisions. I would love to give them analytics they need but there is one problem. Unlike Wine.com local businesses follow different website structure and, therefore, scraping needs to be customized for each and every website. I can't automate and scale scraping programmatically, but we can do it together! ☺
Here is how you can help me:
Once you upload your datasets I will injest them on my end and will aggregate them into 1 master dataset which I will then email to all contributors. At this point we all will have access to the dataset no one else has access to!
This is a pilot batch of 25 wine stores so if each of us spends half an hour to scrape one website then in 30 minutes we will have a dataset one person will have to spend 25*0.5 = 12.5 hours on.
Originally posted here.