This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Parsing dates tutorial.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

Get our environment set up

The first thing we'll need to do is load in the libraries and dataset we'll be using. We'll be working with a dataset containing information on earthquakes that occured between 1965 and 2016.

1) Check the data type of our date column

You'll be working with the "Date" column from the earthquakes dataframe. Investigate this column now: does it look like it contains dates? What is the dtype of the column?

Once you have answered the question above, run the code cell below to get credit for your work.

2) Convert our date columns to datetime

Most of the entries in the "Date" column follow the same format: "month/day/four-digit year". However, the entry at index 3378 follows a completely different pattern. Run the code cell below to see this.

This does appear to be an issue with data entry: ideally, all entries in the column have the same format. We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column.

Looks like there are two more rows that has a date in a different format. Run the code cell below to obtain the indices corresponding to those rows and print the data.

Given all of this information, it's your turn to create a new column "date_parsed" in the earthquakes dataset that has correctly parsed dates in it.

Note: When completing this problem, you are allowed to (but are not required to) amend the entries in the "Date" and "Time" columns. Do not remove any rows from the dataset.

3) Select the day of the month

Create a Pandas Series day_of_month_earthquakes containing the day of the month from the "date_parsed" column.

4) Plot the day of the month to check the date parsing

Plot the days of the month from your earthquake dataset.

Does the graph make sense to you?

(Optional) Bonus Challenge

For an extra challenge, you'll work with a Smithsonian dataset that documents Earth's volcanoes and their eruptive history over the past 10,000 years

Run the next code cell to load the data.

Try parsing the column "Last Known Eruption" from the volcanos dataframe. This column contains a mixture of text ("Unknown") and years both before the common era (BCE, also known as BC) and in the common era (CE, also known as AD).

Count the total amoutn of volcano eruptions BCE, CE and Unknown

(Optional) More practice

If you're interested in graphing time series, check out this tutorial.

You can also look into passing columns that you know have dates in them the parse_dates argument in read_csv. (The documention is here.) Do note that this method can be very slow, but depending on your needs it may sometimes be handy to use.

Keep going

In the next lesson, learn how to work with character encodings.


Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.