This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.
In this exercise, you'll apply what you learned in the Handling missing values tutorial.
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex1 import *
print("Setup Complete")
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3347: DtypeWarning: Columns (22,32) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)):
Setup Complete
Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
# set seed for reproducibility
np.random.seed(0)
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3156: DtypeWarning: Columns (22,32) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Use the code cell below to print the first five rows of the sf_permits
DataFrame.
# TODO: Your code here!
sf_permits.head()
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201505065519 | 4 | sign - erect | 05/06/2015 | 0326 | 023 | 140 | NaN | Ellis | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.785719256680785, -122.40852313194863) | 1380611233945 |
1 | 201604195146 | 4 | sign - erect | 04/19/2016 | 0306 | 007 | 440 | NaN | Geary | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 3.0 | Tenderloin | 94102.0 | (37.78733980600732, -122.41063199757738) | 1420164406718 |
2 | 201605278609 | 3 | additions alterations or repairs | 05/27/2016 | 0595 | 203 | 1647 | NaN | Pacific | Av | ... | 1.0 | constr type 1 | 1.0 | constr type 1 | NaN | 3.0 | Russian Hill | 94109.0 | (37.7946573324287, -122.42232562979227) | 1424856504716 |
3 | 201611072166 | 8 | otc alterations permit | 11/07/2016 | 0156 | 011 | 1230 | NaN | Pacific | Av | ... | 5.0 | wood frame (5) | 5.0 | wood frame (5) | NaN | 3.0 | Nob Hill | 94109.0 | (37.79595867909168, -122.41557405519474) | 1443574295566 |
4 | 201611283529 | 6 | demolitions | 11/28/2016 | 0342 | 001 | 950 | NaN | Market | St | ... | 3.0 | constr type 3 | NaN | NaN | NaN | 6.0 | Tenderloin | 94102.0 | (37.78315261897309, -122.40950883997789) | 144548169992 |
5 rows × 43 columns
Does the dataset have any missing values? Once you have an answer, run the code cell below to get credit for your work.
# Check your answer (Run this code cell to receive credit!)
q1.check()
Correct:
The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.
# Line below will give you a hint
#q1.hint()
What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)
# TODO: Your code here!
total_values = np.product(sf_permits.shape)
missing_values = sf_permits.isnull().sum().sum()
percent_missing = (missing_values/total_values) * 100
print(percent_missing)
# Check your answer
q2.check()
26.26002315058403
Correct
# Lines below will give you a hint or solution code
#q2.hint()
#q2.solution()
Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.
Once you have an answer, run the code cell below.
# Check your answer (Run this code cell to receive credit!)
q3.check()
Correct:
If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.
# Line below will give you a hint
q3.hint()
Hint: Do all addresses generally have a street number suffix? Do all addresses generally have a zipcode?
If you removed all of the rows of sf_permits
with missing values, how many rows are left?
Note: Do not change the value of sf_permits
when checking this.
# TODO: Your code here!
sf_permits.dropna()
Permit Number | Permit Type | Permit Type Definition | Permit Creation Date | Block | Lot | Street Number | Street Number Suffix | Street Name | Street Suffix | ... | Existing Construction Type | Existing Construction Type Description | Proposed Construction Type | Proposed Construction Type Description | Site Permit | Supervisor District | Neighborhoods - Analysis Boundaries | Zipcode | Location | Record ID |
---|
0 rows × 43 columns
Once you have an answer, run the code cell below.
# Check your answer (Run this code cell to receive credit!)
q4.check()
Correct:
There are no rows remaining in the dataset!
# Line below will give you a hint
#q4.hint()
Now try removing all the columns with empty values.
sf_permits_with_na_dropped
that has all of the columns with empty values removed. sf_permits
DataFrame? Use this number to set the value of the dropped_columns
variable below.# TODO: Your code here
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)
dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]
print(dropped_columns)
# Check your answer
q5.check()
31
Correct
# Lines below will give you a hint or solution code
#q5.hint()
#q5.solution()
Try replacing all the NaN's in the sf_permits
data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed
.
# TODO: Your code here
sf_permits_with_na_imputed = sf_permits.fillna(method='bfill', axis=0).fillna(0)
# Check your answer
q6.check()
Correct
# Lines below will give you a hint or solution code
#q6.hint()
#q6.solution()
If you're looking for more practice handling missing values:
sf_permits
dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the Datasets listing.) In the next lesson, learn how to apply scaling and normalization to transform your data.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.