In this exercise, you'll apply what you learned in the Handling missing values tutorial.

Setup¶

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [1]:

from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex1 import *
print("Setup Complete")

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3347: DtypeWarning: Columns (22,32) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):

Setup Complete

In [2]:

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3156: DtypeWarning: Columns (22,32) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [3]:

# TODO: Your code here!
sf_permits.head()

Out[3]:

	Permit Number	Permit Type	Permit Type Definition	Permit Creation Date	Block	Lot	Street Number	Street Number Suffix	Street Name	Street Suffix	...	Existing Construction Type	Existing Construction Type Description	Proposed Construction Type	Proposed Construction Type Description	Site Permit	Supervisor District	Neighborhoods - Analysis Boundaries	Zipcode	Location	Record ID
0	201505065519	4	sign - erect	05/06/2015	0326	023	140	NaN	Ellis	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.785719256680785, -122.40852313194863)	1380611233945
1	201604195146	4	sign - erect	04/19/2016	0306	007	440	NaN	Geary	St	...	3.0	constr type 3	NaN	NaN	NaN	3.0	Tenderloin	94102.0	(37.78733980600732, -122.41063199757738)	1420164406718
2	201605278609	3	additions alterations or repairs	05/27/2016	0595	203	1647	NaN	Pacific	Av	...	1.0	constr type 1	1.0	constr type 1	NaN	3.0	Russian Hill	94109.0	(37.7946573324287, -122.42232562979227)	1424856504716
3	201611072166	8	otc alterations permit	11/07/2016	0156	011	1230	NaN	Pacific	Av	...	5.0	wood frame (5)	5.0	wood frame (5)	NaN	3.0	Nob Hill	94109.0	(37.79595867909168, -122.41557405519474)	1443574295566
4	201611283529	6	demolitions	11/28/2016	0342	001	950	NaN	Market	St	...	3.0	constr type 3	NaN	NaN	NaN	6.0	Tenderloin	94102.0	(37.78315261897309, -122.40950883997789)	144548169992

5 rows × 43 columns

In [4]:

# Check your answer (Run this code cell to receive credit!)
q1.check()

Correct:

The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.

In [5]:

# Line below will give you a hint
#q1.hint()

In [6]:

# TODO: Your code here!
total_values = np.product(sf_permits.shape)
missing_values = sf_permits.isnull().sum().sum()

percent_missing = (missing_values/total_values) * 100

print(percent_missing)

# Check your answer
q2.check()

26.26002315058403

Correct

In [7]:

# Lines below will give you a hint or solution code
#q2.hint()
#q2.solution()

3) Figure out why the data is missing¶

Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.

Which, if either, are missing because they don't exist?
Which, if either, are missing because they weren't recorded?

Once you have an answer, run the code cell below.

In [8]:

# Check your answer (Run this code cell to receive credit!)
q3.check()

Correct:

If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.

In [9]:

# Line below will give you a hint
q3.hint()

Hint: Do all addresses generally have a street number suffix? Do all addresses generally have a zipcode?

4) Drop missing values: rows¶

If you removed all of the rows of sf_permits with missing values, how many rows are left?

Note: Do not change the value of sf_permits when checking this.

In [10]:

# TODO: Your code here!
sf_permits.dropna()

Out[10]:

	Permit Number	Permit Type	Permit Type Definition	Permit Creation Date	Block	Lot	Street Number	Street Number Suffix	Street Name	Street Suffix	...	Existing Construction Type	Existing Construction Type Description	Proposed Construction Type	Proposed Construction Type Description	Site Permit	Supervisor District	Neighborhoods - Analysis Boundaries	Zipcode	Location	Record ID

0 rows × 43 columns

In [11]:

# Check your answer (Run this code cell to receive credit!)
q4.check()

Correct:

There are no rows remaining in the dataset!

In [12]:

# Line below will give you a hint
#q4.hint()

5) Drop missing values: columns¶

Now try removing all the columns with empty values.

Create a new DataFrame called sf_permits_with_na_dropped that has all of the columns with empty values removed.
How many columns were removed from the original sf_permits DataFrame? Use this number to set the value of the dropped_columns variable below.

In [13]:

# TODO: Your code here
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]

print(dropped_columns)

# Check your answer
q5.check()

Correct

In [14]:

# Lines below will give you a hint or solution code
#q5.hint()
#q5.solution()

6) Fill in missing values automatically¶

Try replacing all the NaN's in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed.

In [15]:

# TODO: Your code here
sf_permits_with_na_imputed = sf_permits.fillna(method='bfill', axis=0).fillna(0)

# Check your answer
q6.check()

Correct

In [16]:

# Lines below will give you a hint or solution code
#q6.hint()
#q6.solution()

More practice¶

If you're looking for more practice handling missing values:

Check out this noteboook on handling missing values using scikit-learn's imputer.
Look back at the "Zipcode" column in the sf_permits dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the Datasets listing.)

Keep going¶

In the next lesson, learn how to apply scaling and normalization to transform your data.

Setup¶

1) Take a first look at the data¶

2) How many missing data points do we have?¶

3) Figure out why the data is missing¶

4) Drop missing values: rows¶

5) Drop missing values: columns¶

6) Fill in missing values automatically¶

More practice¶

Keep going¶