Red Wine data analysis (with R)

1 minute read

image-center

Can we use the physicochemical characteristics of a wine to predict his quality?

For this first blog I will show you my data exploration of the red wine data set (from Cortez et al., 2009) using R software and extract some conclusions.

This data set consists of 1599 observations and 12 variables. Below are the 12 variables:

  1. fixed acidity (tartaric acid – g / dm^3)
  2. volatile acidity (acetic acid – g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride – g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate – g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)

How the alcohol percentage is distributed in the data?


We can see that most Portuguese wine had an alcohol percent around 9.5.

Can the pH tell us something?


As the quality increases, the pH decreases even though this trend is not observe for wine from 5 and 6 qualities. The pH determines the acidity/basicity of a wine, low pH gives acidity and high gives basicity, usually the wine’s pH is around 3-4. Also the pH is a barrier for microbial growth, so having a low pH protect the wine from degradation.

Is there a correlation between citric acid and pH?


I categorized the quality into 3 groups: low (3-4), medium(5-6) and high (7-8). The pH decreases as the citric acid increases and this relationship is more strong in high quality wines. It’s more likely to found high quality wine with low pH and high citric acid. Medium and high range from 0 to 0.8 of citric acid.

To conclude

Here I did not show all the steps and code because it will be too long, so I have just put the main results. This analysis may help to check the authenticity and quality of red wine, but there are some limitations of the data. We should have an idea of the law limits of each parameters in that country in order to remove it from the analysis. For the next post I will use machine learning to classify and predict the wine quality.