This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.
In this exercise, you'll apply what you learned in the Parsing dates tutorial.
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex3 import *
print("Setup Complete")
Setup Complete
The first thing we'll need to do is load in the libraries and dataset we'll be using. We'll be working with a dataset containing information on earthquakes that occured between 1965 and 2016.
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
# read in our data
earthquakes = pd.read_csv("../input/earthquake-database/database.csv")
# set seed for reproducibility
np.random.seed(0)
You'll be working with the "Date" column from the earthquakes
dataframe. Investigate this column now: does it look like it contains dates? What is the dtype of the column?
# TODO: Your code here!
earthquakes['Date'].head()
0 01/02/1965 1 01/04/1965 2 01/05/1965 3 01/08/1965 4 01/09/1965 Name: Date, dtype: object
Once you have answered the question above, run the code cell below to get credit for your work.
# Check your answer (Run this code cell to receive credit!)
q1.check()
Correct:
The "Date" column in the earthquakes
DataFrame does have dates. The dtype is "object".
# Line below will give you a hint
#q1.hint()
Most of the entries in the "Date" column follow the same format: "month/day/four-digit year". However, the entry at index 3378 follows a completely different pattern. Run the code cell below to see this.
earthquakes[3378:3383]
Date | Time | Latitude | Longitude | Type | Depth | Depth Error | Depth Seismic Stations | Magnitude | Magnitude Type | ... | Magnitude Seismic Stations | Azimuthal Gap | Horizontal Distance | Horizontal Error | Root Mean Square | ID | Source | Location Source | Magnitude Source | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3378 | 1975-02-23T02:58:41.000Z | 1975-02-23T02:58:41.000Z | 8.017 | 124.075 | Earthquake | 623.0 | NaN | NaN | 5.6 | MB | ... | NaN | NaN | NaN | NaN | NaN | USP0000A09 | US | US | US | Reviewed |
3379 | 02/23/1975 | 03:53:36 | -21.727 | -71.356 | Earthquake | 33.0 | NaN | NaN | 5.6 | MB | ... | NaN | NaN | NaN | NaN | NaN | USP0000A0A | US | US | US | Reviewed |
3380 | 02/23/1975 | 07:34:11 | -10.879 | 166.667 | Earthquake | 33.0 | NaN | NaN | 5.5 | MS | ... | NaN | NaN | NaN | NaN | NaN | USP0000A0C | US | US | US | Reviewed |
3381 | 02/25/1975 | 05:20:05 | -7.388 | 149.798 | Earthquake | 33.0 | NaN | NaN | 5.5 | MB | ... | NaN | NaN | NaN | NaN | NaN | USP0000A12 | US | US | US | Reviewed |
3382 | 02/26/1975 | 04:48:55 | 85.047 | 97.969 | Earthquake | 33.0 | NaN | NaN | 5.6 | MS | ... | NaN | NaN | NaN | NaN | NaN | USP0000A1H | US | US | US | Reviewed |
5 rows × 21 columns
This does appear to be an issue with data entry: ideally, all entries in the column have the same format. We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column.
date_lengths = earthquakes.Date.str.len()
date_lengths.value_counts()
10 23409 24 3 Name: Date, dtype: int64
Looks like there are two more rows that has a date in a different format. Run the code cell below to obtain the indices corresponding to those rows and print the data.
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
Indices with corrupted data: [ 3378 7512 20650]
Date | Time | Latitude | Longitude | Type | Depth | Depth Error | Depth Seismic Stations | Magnitude | Magnitude Type | ... | Magnitude Seismic Stations | Azimuthal Gap | Horizontal Distance | Horizontal Error | Root Mean Square | ID | Source | Location Source | Magnitude Source | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3378 | 1975-02-23T02:58:41.000Z | 1975-02-23T02:58:41.000Z | 8.017 | 124.075 | Earthquake | 623.0 | NaN | NaN | 5.6 | MB | ... | NaN | NaN | NaN | NaN | NaN | USP0000A09 | US | US | US | Reviewed |
7512 | 1985-04-28T02:53:41.530Z | 1985-04-28T02:53:41.530Z | -32.998 | -71.766 | Earthquake | 33.0 | NaN | NaN | 5.6 | MW | ... | NaN | NaN | NaN | NaN | 1.30 | USP0002E81 | US | US | HRV | Reviewed |
20650 | 2011-03-13T02:23:34.520Z | 2011-03-13T02:23:34.520Z | 36.344 | 142.344 | Earthquake | 10.1 | 13.9 | 289.0 | 5.8 | MWC | ... | NaN | 32.3 | NaN | NaN | 1.06 | USP000HWQP | US | US | GCMT | Reviewed |
3 rows × 21 columns
Given all of this information, it's your turn to create a new column "date_parsed" in the earthquakes
dataset that has correctly parsed dates in it.
Note: When completing this problem, you are allowed to (but are not required to) amend the entries in the "Date" and "Time" columns. Do not remove any rows from the dataset.
# TODO: Your code here
earthquakes.loc[3378, "Date"] = "02/23/1975"
earthquakes.loc[7512, "Date"] = "04/28/1985"
earthquakes.loc[20650, "Date"] = "03/13/2011"
earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%Y")
# Check your answer
q2.check()
Correct
# Lines below will give you a hint or solution code
q2.hint()
q2.solution()
Hint: Since there are only three rows with a fancy type, you might consider manually editing them. For instance, you can begin by setting earthquakes.loc[3378, "Date"] = "02/23/1975"
.
Solution:
earthquakes.loc[3378, "Date"] = "02/23/1975"
earthquakes.loc[7512, "Date"] = "04/28/1985"
earthquakes.loc[20650, "Date"] = "03/13/2011"
earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%Y")
Create a Pandas Series day_of_month_earthquakes
containing the day of the month from the "date_parsed" column.
# try to get the day of the month from the date column
day_of_month_earthquakes = earthquakes['date_parsed'].dt.day
# Check your answer
q3.check()
Correct
# Lines below will give you a hint or solution code
#q3.hint()
#q3.solution()
Plot the days of the month from your earthquake dataset.
# TODO: Your code here!
day_of_the_month = earthquakes['date_parsed'].dt.day
sns.histplot(day_of_the_month, kde=False, bins=31)
<AxesSubplot:xlabel='date_parsed', ylabel='Count'>
Does the graph make sense to you?
# Check your answer (Run this code cell to receive credit!)
q4.check()
Correct:
The graph should make sense: it shows a relatively even distribution in days of the month,which is what we would expect.
# Line below will give you a hint
#q4.hint()
For an extra challenge, you'll work with a Smithsonian dataset that documents Earth's volcanoes and their eruptive history over the past 10,000 years
Run the next code cell to load the data.
volcanos = pd.read_csv("../input/volcanic-eruptions/database.csv")
Try parsing the column "Last Known Eruption" from the volcanos
dataframe. This column contains a mixture of text ("Unknown") and years both before the common era (BCE, also known as BC) and in the common era (CE, also known as AD).
volcanos['Last Known Eruption'].sample(5)
556 Unknown 557 Unknown 530 3500 BCE 130 1863 CE 1179 Unknown Name: Last Known Eruption, dtype: object
volcano_eruptions_before_common_era = volcanos[volcanos['Last Known Eruption'].str.contains(' BCE')]
volcano_eruptions_common_era = volcanos[volcanos['Last Known Eruption'].str.contains(' CE')]
unknown_volcano_eruptions = volcanos[volcanos['Last Known Eruption'] == 'Unknown']
total_amount_of_data_from_all_the_groups = len(volcano_eruptions_before_common_era) + len(volcano_eruptions_common_era) + len(unknown_volcano_eruptions)
if len(volcanos) != total_amount_of_data_from_all_the_groups:
print("Some data is missing")
print("BCE amount: {} this means a {:.2f}% of the dataset".format(volcano_eruptions_before_common_era.shape[0], (volcano_eruptions_before_common_era.shape[0]/total_amount_of_data_from_all_the_groups) * 100))
print("CE amount: {} this means a {:.2f}% of the dataset".format(volcano_eruptions_common_era.shape[0], (volcano_eruptions_common_era.shape[0]/total_amount_of_data_from_all_the_groups) * 100))
print("Unknown amount: {} this means a {:.2f}% of the dataset".format(unknown_volcano_eruptions.shape[0], (unknown_volcano_eruptions.shape[0]/total_amount_of_data_from_all_the_groups) * 100))
BCE amount: 174 this means a 11.54% of the dataset CE amount: 697 this means a 46.22% of the dataset Unknown amount: 637 this means a 42.24% of the dataset
If you're interested in graphing time series, check out this tutorial.
You can also look into passing columns that you know have dates in them the parse_dates
argument in read_csv
. (The documention is here.) Do note that this method can be very slow, but depending on your needs it may sometimes be handy to use.
In the next lesson, learn how to work with character encodings.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.