This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Handling missing values tutorial.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

1) Take a first look at the data

Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.

Use the code cell below to print the first five rows of the sf_permits DataFrame.

Does the dataset have any missing values? Once you have an answer, run the code cell below to get credit for your work.

2) How many missing data points do we have?

What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)

3) Figure out why the data is missing

Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.

Once you have an answer, run the code cell below.

4) Drop missing values: rows

If you removed all of the rows of sf_permits with missing values, how many rows are left?

Note: Do not change the value of sf_permits when checking this.

Once you have an answer, run the code cell below.

5) Drop missing values: columns

Now try removing all the columns with empty values.

6) Fill in missing values automatically

Try replacing all the NaN's in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed.

More practice

If you're looking for more practice handling missing values:

Keep going

In the next lesson, learn how to apply scaling and normalization to transform your data.


Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.