Data Science Tells This Story About the Global Wine Markets 🍷

5/5 - (2 votes)

πŸ“– Background

Many people like to relax or party with a glass of wine. That makes wine an important industry in many countries. Understanding this market is important to the livelihood of many people.

For fun, consider the following fictional scenario:

🍷 Story: You work at a multinational consumer goods organization that is considering entering the wine production industry. Managers at your company would like to understand the market better before making a decision.

πŸ’Ύ The Data

This dataset is a subset of the University of Adelaide’s Annual Database of Global Wine Markets.

The dataset consists of a single CSV file, data/wine.csv.

Each row in the dataset represents the wine market in one country. There are 34 metrics for the wine industry covering both the production and consumption sides of the market.

import pandas as pd

wine = pd.read_csv("wine.csv")
print(wine)

πŸ’‘ Info: The pandas.read_csv() is a function in the Pandas library that reads data from a CSV file and creates a DataFrame object. It has various parameters for customization and can handle missing data, date parsing, and different data formats. It’s a useful tool for importing and manipulating CSV data in Python.

πŸ’ͺ Challenge

Explore the dataset to understand the global wine market.

The given analysis should satisfy four criteria: Technical approach (20%), Visualizations (20%), Storytelling (30%), and Insights and recommendations (30%).

The Technical approach will focus on the soundness of the approach and the quality of the code. Visualizations will assess whether the visualizations are appropriate and capable of providing clear insights. The Storytelling component will evaluate whether the data supports the narrative and if the narrative is detailed yet concise. The Insights and recommendations component will check for clarity, relevance to the domain, and recognition of analysis limitations.


🍷 Wine Market Analysis in Four Steps

Step 1: Data Preparation

Import the necessary libraries, and the dataset. Then, if necessary I clean the data and see what features are available for analysis.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

wine = pd.read_csv("wine.csv")

# Check DataFrame
print(wine.info())

I print some information about the DataFrame to get the column names and non-zero values with df.info() method.

πŸ’‘ Info: The Pandas DataFrame.info() method provides a concise summary of a DataFrame’s content and structure, including data types, column names, memory usage, and the presence of null values. It’s useful for data inspection, optimization, and error-checking.

Check β€žNaN” values:

wine[wine.isna().any(axis=1)]

There is some “NaN” that we have to manage. It is logical that if there is no β€žVine Area” then they cannot produce wine. So where there is 0 area, we change production to 0.

print(wine[['Country', 'Wine produced (ML)']][wine["Vine Area ('000 ha)"]==0])
	      Country    		Wine produced (ML)
6         Denmark           NaN
7         Finland           NaN
10        Ireland           NaN
12        Sweden            NaN
42        Hong Kong         NaN
46        Malaysia          NaN
47        Philippines       NaN
48        Singapore         NaN
wine.loc[wine["Vine Area ('000 ha)"] == 0, "Wine produced (ML)"] = 0

πŸ’‘ Info: The DataFrame.loc is a powerful Pandas method used for selecting or modifying data based on labels or boolean conditions. It allows for versatile data manipulations, including filtering, sorting, and value assignment.

You can watch an explainer video on it here:

Pandas loc() and iloc() - A Simple Guide

Step 2: Gain Data Overview

To find the biggest importers and exporters and to get a more comprehensive picture of the market, I have created some queries.

DataFrame.nlargest(n, columns) is the easiest way to perform the search, where “n” is the number of hits and “columns” is the name of the column being searched. nlargest() returns the values in sorted order.

best_10_importers_by_value = wine.nlargest(10, 'Value of wine imports (US$ mill)')
print(best_10_importers_by_value)
best_10_importers_by_liter = wine.nlargest(10, 'Wine import vol. (ML)')
print(best_10_importers_by_liter)
best_10_exporters_by_value = wine.nlargest(10, 'Value of wine exports (US$ mill)')
print(best_10_exporters_by_value)
best_10_exporters_by_liter = wine.nlargest(10, 'Wine export vol. (ML)')
print(best_10_exporters_by_liter)

Step 3: Create Diagrams

It is time to create diagrams.

Let’s look at imports/exports by country. I have put the import/export columns on the y-axis of a barplot for easy comparison. A barplot displays the relationship between a numeric (export/import) and a categorical (Countries) variable.

πŸ’‘ Info: The pandas.DataFrame.plot() is a method in the Pandas library that generates various visualizations from DataFrame objects. It’s easy to use and allows for customization of plot appearance and behavior. plot() is a useful tool for data exploration, communication, and hypothesis testing.

I used the pandas built-in plot function to create the chart. The plot function here takes the x and y values, the kind of graph, and the title as arguments.

best_10_importers_by_liter.plot(x =    'Country', y = ['Wine import vol. (ML)', 'Wine export vol. (ML)'], kind = 'bar', title = 'Import / Export by Country')

The first insight that I got, is that it’s a bit confusing that France has the largest exports but still takes the 4th (and third) place in imports… The French seem to like foreign wines.

See what countries do not produce enough wine to cover their own consumption! To do this, I subtracted wine production and exports from their own consumption in a new column.

#create new column to calculate wine demand
wine['wine_demand'] = wine['Wine consumed (ML)'] - (wine['Wine produced (ML)'] - wine['Wine export vol. (ML)'])

top_10_wine_demand = wine.nlargest(10, 'wine_demand')
print(top_10_wine_demand)

Or, visualized:

Is there enough GDP per capita for consumption?

I think that people who live in countries with high GDP per capita can afford more expensive and more wine.

I have created a seaborn relation plot, where the hue represents GDP and the y-axis represents wine demand.

πŸ’‘ Info: Seaborn is a Python data visualization library that offers a high-level interface for creating informative and attractive statistical graphics. It’s built on top of the Matplotlib library and includes several color palettes and themes, making it easy to create complex visualizations with minimal code. Seaborn is often used for data exploration, visualization in scientific research, and communication of data insights.

I set the plot style to 'darkgrid' for better look. Please note that this setting will remain as long as you do not change it, including the following graphs.

Seaborn’s relplot returns a FacetGrid object which has a set_xticklabels function to customize x labels.

sns.set_style('darkgrid')
chart = sns.relplot(data = top_10_wine_demand, x = 'Country', y = 'wine_demand', hue = "GDP per capita ('000 US$)")
chart.set_xticklabels(rotation = 65, horizontalalignment = 'right')

My main conclusion from this is that if you have a winery in Europe, the best place to sell your wine is in the UK and Germany, and otherwise, in the US.

Step 4: Competitor Analysis

And now, let’s look at the competitors:

Where is the cheapest wine from, and what country exports lot of cheap wine?

Since we have no data on this, I did a little feature engineering to find out which countries export wine at the lowest price per litre. Feature engineering

when we create a feature (a new column) to add useful information from existing data to your dataset.

wine['export_per_liter'] = wine['Value of wine exports (US$ mill)'] / wine['Wine export vol. (ML)']

top_10_cheapest = wine.nsmallest(10, 'export_per_liter')
print(top_10_cheapest)

Plot the findings:

top_10_cheapest.plot(x = 'Country', y = ['Value of wine exports (US$ mill)', 'Wine export vol. (ML)'], kind = 'bar', figsize = (8, 6))
plt.legend(loc = 'upper left', title = 'Cheapest wine exporters')

It is clear that Spain is by far the biggest exporter of cheap wine, followed by South Africa, but in much smaller quantities.

Conclusion

If you want to gain insight into large data sets, visualization is king and you don’t need fancy, complicated graphs to see the relationships behind the data clearly.

Understanding the tools is vital — without DataFrames, we wouldn’t have been able to pull off this analysis quickly and efficiently:

πŸ‘‰ Recommended Tutorial: Pandas in 10 Minutes