Now you are ready to get a deeper understanding of your data.
Run the following cell to load your data and some utility functions (including code to check your answers).
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")
reviews.head()
Setup complete.
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
What is the median of the points
column in the reviews
DataFrame?
median_points = reviews.points.median()
# Check your answer
q1.check()
Correct
#q1.hint()
#q1.solution()
Solution:
median_points = reviews.points.median()
What countries are represented in the dataset? (Your answer should not include any duplicates.)
countries = reviews.country.unique()
# Check your answer
q2.check()
Correct
#q2.hint()
#q2.solution()
How often does each country appear in the dataset? Create a Series reviews_per_country
mapping countries to the count of reviews of wines from that country.
reviews_per_country = reviews.country.value_counts()
# Check your answer
q3.check()
Correct
#q3.hint()
#q3.solution()
Create variable centered_price
containing a version of the price
column with the mean price subtracted.
(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)
mean_price = reviews.price.mean()
centered_price = reviews.price.map(lambda r: r - mean_price)
# Check your answer
q4.check()
Correct
q4.hint()
#q4.solution()
Hint: To get the mean of a column in a Pandas DataFrame, use the mean
function.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine
with the title of the wine with the highest points-to-price ratio in the dataset.
bargain_idmax = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idmax, 'title']
# Check your answer
q5.check()
Correct
q5.hint()
q5.solution()
Hint: The idxmax
method may be useful here.
Solution:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts
counting how many times each of these two words appears in the description
column in the dataset.
reviews.description.head()
0 Aromas include tropical fruit, broom, brimston... 1 This is ripe and fruity, a wine that is smooth... 2 Tart and snappy, the flavors of lime flesh and... 3 Pineapple rind, lemon pith and orange blossom ... 4 Much like the regular bottling from 2012, this... Name: description, dtype: object
# descriptor_counts = reviews.description.map(lambda r: 'tropical' in r or 'fruity' in r).value_counts()
n_tropical = reviews.description.map(lambda d: 'tropical' in d).sum()
n_fruit = reviews.description.map(lambda d: 'fruity' in d).sum()
descriptor_counts = pd.Series([n_tropical, n_fruit], index=['tropical', 'fruity'])
# Check your answer
q6.check()
Correct
q6.hint()
q6.solution()
Hint: Use a map to check each description for the string tropical
, then count up the number of times this is True
. Repeat this for fruity
. Finally, create a Series
combining the two values.
Solution:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings
with the number of stars corresponding to each review in the dataset.
reviews[reviews.country == 'Canada']
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
454 | Canada | An aromatic knockout with notes of peach, papa... | Reserve Icewine | 92 | 30.0 | Ontario | Niagara-On-The-Lake | NaN | Sean P. Sullivan | @wawinereport | Pillitteri 2012 Reserve Icewine Vidal (Niagara... | Vidal | Pillitteri |
2616 | Canada | A slightly earthy, spicy nose leads, followed ... | Fusion | 83 | 12.0 | Ontario | Niagara Peninsula | NaN | Susan Kostrzewa | @suskostrzewa | Pillitteri 2004 Fusion Gewürztraminer-Riesling... | Gewürztraminer-Riesling | Pillitteri |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129528 | Canada | A delicious though somewhat reserved wine with... | Icewine | 89 | 45.0 | Ontario | Niagara Peninsula | NaN | Sean P. Sullivan | @wawinereport | Henry of Pelham 2011 Icewine Vidal (Niagara Pe... | Vidal | Henry of Pelham |
129581 | Canada | Smooth and engaging, this offers classic varie... | NaN | 90 | 38.0 | British Columbia | Okanagan Valley | NaN | Paul Gregutt | @paulgwine | Burrowing Owl 2013 Syrah (Okanagan Valley) | Syrah | Burrowing Owl |
257 rows × 13 columns
# este codigo me pasaba como correcto y esta mal 🤷
# def calculate_stars_from_points(points):
# if points > 94:
# return 3
# if points > 84:
# return 2
# return 1
# star_ratings = reviews.points.map(calculate_stars_from_points)
def calculate_stars(row):
if row.country == 'Canada':
return 3
if row.points > 94:
return 3
if row.points > 84:
return 2
return 1
star_ratings = reviews.apply(calculate_stars, axis='columns')
# Check your answer
q7.check()
Correct
#q7.hint()
q7.solution()
Solution:
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
Continue to grouping and sorting.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.