In these exercises we'll apply groupwise analysis to our dataset.
Run the code cell below to load the data before running the exercises.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")
Setup complete.
reviews.head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
Who are the most common wine reviewers in the dataset? Create a Series
whose index is the taster_twitter_handle
category from the dataset, and whose values count how many reviews each person wrote.
# Your code here
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
# reviews_written = reviews.groupby('taster_twitter_handle').size()
# Check your answer
q1.check()
Correct:
reviews_written = reviews.groupby('taster_twitter_handle').size()
or
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
q1.hint()
#q1.solution()
Hint: Use the groupby
operation and size()
(or count()
).
What is the best wine I can buy for a given amount of money? Create a Series
whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0
dollars is at the top and 3300.0
dollars is at the bottom).
best_rating_per_price = reviews.groupby('price').points.max()
# Check your answer
q2.check()
Correct
#q2.hint()
#q2.solution()
What are the minimum and maximum prices for each variety
of wine? Create a DataFrame
whose index is the variety
category from the dataset and whose values are the min
and max
values thereof.
price_extremes = reviews.groupby('variety').price.agg(['min', 'max'])
# Check your answer
q3.check()
Correct
q3.hint()
q3.solution()
Hint: Use agg()
.
Solution:
price_extremes = reviews.groupby('variety').price.agg([min, max])
What are the most expensive wine varieties? Create a variable sorted_varieties
containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
sorted_varieties = price_extremes.sort_values(['min', 'max'], ascending=False)
# Check your answer
q4.check()
Correct
#q4.hint()
#q4.solution()
Create a Series
whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name
and points
columns.
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
# Check your answer
q5.check()
85.855421686747
Correct
#q5.hint()
#q5.solution()
Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe()
method to see a summary of the range of values.
reviewer_mean_ratings.describe()
count 19.000000 mean 88.233026 std 1.243610 min 85.855422 25% 87.323501 50% 88.536235 75% 88.975256 max 90.562551 Name: points, dtype: float64
What combination of countries and varieties are most common? Create a Series
whose index is a MultiIndex
of {country, variety}
pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}
. Sort the values in the Series
in descending order based on wine count.
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
# Check your answer
q6.check()
Correct
q6.hint()
#q6.solution()
Hint: Use groupby()
, and provide a list of columns to group by. You may also find size()
and sort_values()
to be useful!
Move on to the data types and missing data.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.