This notebook is an exercise in the Data Visualization course. You can reference the tutorial at this link.
In this exercise, you will use your new knowledge to propose a solution to a real-world scenario. To succeed, you will need to import data into Python, answer questions using the data, and generate scatter plots to understand patterns in the data.
You work for a major candy producer, and your goal is to write a report that your company can use to guide the design of its next product. Soon after starting your research, you stumble across this very interesting dataset containing results from a fun survey to crowdsource favorite candies.
Run the next cell to import and configure the Python libraries that you need to complete the exercise.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")
Setup Complete
The questions below will give you feedback on your work. Run the following cell to set up our feedback system.
# Set up code checking
import os
if not os.path.exists("../input/candy.csv"):
os.symlink("../input/data-for-datavis/candy.csv", "../input/candy.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex4 import *
print("Setup Complete")
Setup Complete
Read the candy data file into candy_data
. Use the "id"
column to label the rows.
# Path of the file to read
candy_filepath = "../input/candy.csv"
# Fill in the line below to read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath, index_col="id")
# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()
Correct
# Lines below will give you a hint or solution code
# step_1.hint()
# step_1.solution()
Hint: Use pd.read_csv
, and follow it with two pieces of text that are enclosed in parentheses and separated by commas. (1) The filepath for the dataset is provided in candy_filepath
. (2) Use the "id"
column to label the rows.
Use a Python command to print the first five rows of the data.
# Print the first five rows of the data
candy_data.head()
competitorname | chocolate | fruity | caramel | peanutyalmondy | nougat | crispedricewafer | hard | bar | pluribus | sugarpercent | pricepercent | winpercent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
0 | 100 Grand | Yes | No | Yes | No | No | Yes | No | Yes | No | 0.732 | 0.860 | 66.971725 |
1 | 3 Musketeers | Yes | No | No | No | Yes | No | No | Yes | No | 0.604 | 0.511 | 67.602936 |
2 | Air Heads | No | Yes | No | No | No | No | No | No | No | 0.906 | 0.511 | 52.341465 |
3 | Almond Joy | Yes | No | No | Yes | No | No | No | Yes | No | 0.465 | 0.767 | 50.347546 |
4 | Baby Ruth | Yes | No | Yes | Yes | Yes | No | No | Yes | No | 0.604 | 0.767 | 56.914547 |
The dataset contains 83 rows, where each corresponds to a different candy bar. There are 13 columns:
'competitorname'
contains the name of the candy bar. 'chocolate'
to 'pluribus'
) describe the candy. For instance, rows with chocolate candies have "Yes"
in the 'chocolate'
column (and candies without chocolate have "No"
in the same column).'sugarpercent'
provides some indication of the amount of sugar, where higher values signify higher sugar content.'pricepercent'
shows the price per unit, relative to the other candies in the dataset.'winpercent'
is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.Use the first five rows of the data to answer the questions below.
# Fill in the line below: Which candy was more popular with survey respondents:
# '3 Musketeers' or 'Almond Joy'? (Please enclose your answer in single quotes.)
more_popular = '3 Musketeers'
# Fill in the line below: Which candy has higher sugar content: 'Air Heads'
# or 'Baby Ruth'? (Please enclose your answer in single quotes.)
more_sugar = 'Air Heads'
# Check your answers
step_2.check()
Correct
# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()
Do people tend to prefer candies with higher sugar content?
Create a scatter plot that shows the relationship between 'sugarpercent'
(on the horizontal x-axis) and 'winpercent'
(on the vertical y-axis). Don't add a regression line just yet -- you'll do that in the next step!
# Scatter plot showing the relationship between 'sugarpercent' and 'winpercent'
sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])
# Check your answer
step_3.a.check()
Correct
# Lines below will give you a hint or solution code
# step_3.a.hint()
# step_3.a.solution_plot()
Does the scatter plot show a strong correlation between the two variables? If so, are candies with more sugar relatively more or less popular with the survey respondents?
#step_3.b.hint()
# Check your answer (Run this code cell to receive credit!)
step_3.b.solution()
Solution: The scatter plot does not show a strong correlation between the two variables. Since there is no clear relationship between the two variables, this tells us that sugar content does not play a strong role in candy popularity.
# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent'
sns.regplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])
# Check your answer
step_4.a.check()
Correct
# Lines below will give you a hint or solution code
#step_4.a.hint()
#step_4.a.solution_plot()
According to the plot above, is there a slight correlation between 'winpercent'
and 'sugarpercent'
? What does this tell you about the candy that people tend to prefer?
#step_4.b.hint()
# Check your answer (Run this code cell to receive credit!)
step_4.b.solution()
Solution: Since the regression line has a slightly positive slope, this tells us that there is a slightly positive correlation between 'winpercent'
and 'sugarpercent'
. Thus, people have a slight preference for candies containing relatively more sugar.
In the code cell below, create a scatter plot to show the relationship between 'pricepercent'
(on the horizontal x-axis) and 'winpercent'
(on the vertical y-axis). Use the 'chocolate'
column to color-code the points. Don't add any regression lines just yet -- you'll do that in the next step!
# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate'
sns.scatterplot(x=candy_data['pricepercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])
# Check your answer
step_5.check()
Correct
# Lines below will give you a hint or solution code
#step_5.hint()
#step_5.solution_plot()
Can you see any interesting patterns in the scatter plot? We'll investigate this plot further by adding regression lines in the next step!
Create the same scatter plot you created in Step 5, but now with two regression lines, corresponding to (1) chocolate candies and (2) candies without chocolate.
# Color-coded scatter plot w/ regression lines
sns.lmplot(x='pricepercent', y='winpercent', hue='chocolate', data=candy_data)
# Check your answer
step_6.a.check()
Correct
# Lines below will give you a hint or solution code
#step_6.a.hint()
#step_6.a.solution_plot()
Using the regression lines, what conclusions can you draw about the effects of chocolate and price on candy popularity?
#step_6.b.hint()
# Check your answer (Run this code cell to receive credit!)
step_6.b.solution()
Solution: We'll begin with the regression line for chocolate candies. Since this line has a slightly positive slope, we can say that more expensive chocolate candies tend to be more popular (than relatively cheaper chocolate candies). Likewise, since the regression line for candies without chocolate has a negative slope, we can say that if candies don't contain chocolate, they tend to be more popular when they are cheaper. One important note, however, is that the dataset is quite small -- so we shouldn't invest too much trust in these patterns! To inspire more confidence in the results, we should add more candies to the dataset.
# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])
# Check your answer
step_7.a.check()
Correct
# Lines below will give you a hint or solution code
#step_7.a.hint()
#step_7.a.solution_plot()
Solution:
# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])
You decide to dedicate a section of your report to the fact that chocolate candies tend to be more popular than candies without chocolate. Which plot is more appropriate to tell this story: the plot from Step 6, or the plot from Step 7?
#step_7.b.hint()
# Check your answer (Run this code cell to receive credit!)
step_7.b.solution()
Solution: In this case, the categorical scatter plot from Step 7 is the more appropriate plot. While both plots tell the desired story, the plot from Step 6 conveys far more information that could distract from the main point.
Explore histograms and density plots.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.