{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Welcome to the **Data Cleaning** course on Kaggle Learn!\n",
"\n",
"Data cleaning is a key part of data science, but it can be deeply frustrating. Why are some of your text fields garbled? What should you do about those missing values? Why aren’t your dates formatted correctly? How can you quickly clean up inconsistent data entry? In this course, you'll learn why you've run into these problems and, more importantly, how to fix them!\n",
"\n",
"In this course, you’ll learn how to tackle some of the most common data cleaning problems so you can get to actually analyzing your data faster. You’ll work through five hands-on exercises with real, messy data and answer some of your most commonly-asked data cleaning questions.\n",
"\n",
"In this notebook, we'll look at how to deal with missing values.\n",
"\n",
"# Take a first look at the data\n",
"\n",
"The first thing we'll need to do is load in the libraries and dataset we'll be using. \n",
"\n",
"For demonstration, we'll use a dataset of events that occured in American Football games. In the [**following exercise**](https://www.kaggle.com/kernels/fork/10824396), you'll apply your new skills to a dataset of building permits issued in San Francisco."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:28.713493Z",
"iopub.status.busy": "2021-08-21T21:40:28.712940Z",
"iopub.status.idle": "2021-08-21T21:40:38.104710Z",
"shell.execute_reply": "2021-08-21T21:40:38.103507Z",
"shell.execute_reply.started": "2021-08-21T21:40:28.713353Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3170: DtypeWarning: Columns (25,51) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" interactivity=interactivity, compiler=compiler, result=result)\n"
]
}
],
"source": [
"# modules we'll use\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# read in all our data\n",
"nfl_data = pd.read_csv(\"../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv\")\n",
"\n",
"# set seed for reproducibility\n",
"np.random.seed(0) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing to do when you get a new dataset is take a look at some of it. This lets you see that it all read in correctly and gives an idea of what's going on with the data. In this case, let's see if there are any missing values, which will be reprsented with `NaN` or `None`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:38.108253Z",
"iopub.status.busy": "2021-08-21T21:40:38.107966Z",
"iopub.status.idle": "2021-08-21T21:40:38.157960Z",
"shell.execute_reply": "2021-08-21T21:40:38.156626Z",
"shell.execute_reply.started": "2021-08-21T21:40:38.108225Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" GameID | \n",
" Drive | \n",
" qtr | \n",
" down | \n",
" time | \n",
" TimeUnder | \n",
" TimeSecs | \n",
" PlayTimeDiff | \n",
" SideofField | \n",
" ... | \n",
" yacEPA | \n",
" Home_WP_pre | \n",
" Away_WP_pre | \n",
" Home_WP_post | \n",
" Away_WP_post | \n",
" Win_Prob | \n",
" WPA | \n",
" airWPA | \n",
" yacWPA | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" NaN | \n",
" 15:00 | \n",
" 15 | \n",
" 3600.0 | \n",
" 0.0 | \n",
" TEN | \n",
" ... | \n",
" NaN | \n",
" 0.485675 | \n",
" 0.514325 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.485675 | \n",
" 0.060758 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
" 1 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 1.0 | \n",
" 14:53 | \n",
" 15 | \n",
" 3593.0 | \n",
" 7.0 | \n",
" PIT | \n",
" ... | \n",
" 1.146076 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.546433 | \n",
" 0.004655 | \n",
" -0.032244 | \n",
" 0.036899 | \n",
" 2009 | \n",
"
\n",
" \n",
" 2 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 2.0 | \n",
" 14:16 | \n",
" 15 | \n",
" 3556.0 | \n",
" 37.0 | \n",
" PIT | \n",
" ... | \n",
" NaN | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.551088 | \n",
" -0.040295 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 3.0 | \n",
" 13:35 | \n",
" 14 | \n",
" 3515.0 | \n",
" 41.0 | \n",
" PIT | \n",
" ... | \n",
" -5.031425 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.510793 | \n",
" -0.049576 | \n",
" 0.106663 | \n",
" -0.156239 | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 4.0 | \n",
" 13:27 | \n",
" 14 | \n",
" 3507.0 | \n",
" 8.0 | \n",
" PIT | \n",
" ... | \n",
" NaN | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.558929 | \n",
" 0.441071 | \n",
" 0.461217 | \n",
" 0.097712 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 102 columns
\n",
"
"
],
"text/plain": [
" Date GameID Drive qtr down time TimeUnder TimeSecs \\\n",
"0 2009-09-10 2009091000 1 1 NaN 15:00 15 3600.0 \n",
"1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593.0 \n",
"2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556.0 \n",
"3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515.0 \n",
"4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507.0 \n",
"\n",
" PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre \\\n",
"0 0.0 TEN ... NaN 0.485675 0.514325 \n",
"1 7.0 PIT ... 1.146076 0.546433 0.453567 \n",
"2 37.0 PIT ... NaN 0.551088 0.448912 \n",
"3 41.0 PIT ... -5.031425 0.510793 0.489207 \n",
"4 8.0 PIT ... NaN 0.461217 0.538783 \n",
"\n",
" Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season \n",
"0 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009 \n",
"1 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009 \n",
"2 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009 \n",
"3 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009 \n",
"4 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009 \n",
"\n",
"[5 rows x 102 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# look at the first five rows of the nfl_data file. \n",
"# I can see a handful of missing data already!\n",
"nfl_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yep, it looks like there's some missing values. \n",
"\n",
"# How many missing data points do we have?\n",
"\n",
"Ok, now we know that we do have some missing values. Let's see how many we have in each column. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:38.159748Z",
"iopub.status.busy": "2021-08-21T21:40:38.159484Z",
"iopub.status.idle": "2021-08-21T21:40:39.230761Z",
"shell.execute_reply": "2021-08-21T21:40:39.229509Z",
"shell.execute_reply.started": "2021-08-21T21:40:38.159722Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Date 0\n",
"GameID 0\n",
"Drive 0\n",
"qtr 0\n",
"down 61154\n",
"time 224\n",
"TimeUnder 0\n",
"TimeSecs 224\n",
"PlayTimeDiff 444\n",
"SideofField 528\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the number of missing data points per column\n",
"missing_values_count = nfl_data.isnull().sum()\n",
"\n",
"# look at the # of missing points in the first ten columns\n",
"missing_values_count[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:39.233557Z",
"iopub.status.busy": "2021-08-21T21:40:39.233027Z",
"iopub.status.idle": "2021-08-21T21:40:39.240681Z",
"shell.execute_reply": "2021-08-21T21:40:39.239612Z",
"shell.execute_reply.started": "2021-08-21T21:40:39.233504Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24.87214126835169\n"
]
}
],
"source": [
"# how many total missing values do we have?\n",
"total_cells = np.product(nfl_data.shape)\n",
"total_missing = missing_values_count.sum()\n",
"\n",
"# percent of data that is missing\n",
"percent_missing = (total_missing/total_cells) * 100\n",
"print(percent_missing)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.\n",
"\n",
"# Figure out why the data is missing\n",
" \n",
"This is the point at which we get into the part of data science that I like to call \"data intution\", by which I mean \"really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis\". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important questions you can ask yourself to help figure this out is this:\n",
"\n",
"> **Is this value missing because it wasn't recorded or because it doesn't exist?**\n",
"\n",
"If a value is missing becuase it doesn't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probably do want to keep as `NaN`. On the other hand, if a value is missing because it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. This is called **imputation**, and we'll learn how to do it next! :)\n",
"\n",
"Let's work through an example. Looking at the number of missing values in the `nfl_data` dataframe, I notice that the column \"TimesSec\" has a lot of missing values in it: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:39.243601Z",
"iopub.status.busy": "2021-08-21T21:40:39.242820Z",
"iopub.status.idle": "2021-08-21T21:40:39.258000Z",
"shell.execute_reply": "2021-08-21T21:40:39.256307Z",
"shell.execute_reply.started": "2021-08-21T21:40:39.243540Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Date 0\n",
"GameID 0\n",
"Drive 0\n",
"qtr 0\n",
"down 61154\n",
"time 224\n",
"TimeUnder 0\n",
"TimeSecs 224\n",
"PlayTimeDiff 444\n",
"SideofField 528\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# look at the # of missing points in the first ten columns\n",
"missing_values_count[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By looking at [the documentation](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016), I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.\n",
"\n",
"On the other hand, there are other fields, like \"PenalizedTeam\" that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say *which* team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like \"neither\" and use that to replace the NA's.\n",
"\n",
"> **Tip:** This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more information.\n",
"\n",
"If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some \"quick and dirty\" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.\n",
"\n",
"# Drop missing values\n",
"\n",
"If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.) \n",
"\n",
"If you're sure you want to drop rows with missing values, pandas does have a handy function, `dropna()` to help you do this. Let's try it out on our NFL dataset!"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:39.260262Z",
"iopub.status.busy": "2021-08-21T21:40:39.259805Z",
"iopub.status.idle": "2021-08-21T21:40:40.400324Z",
"shell.execute_reply": "2021-08-21T21:40:40.399638Z",
"shell.execute_reply.started": "2021-08-21T21:40:39.260210Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" GameID | \n",
" Drive | \n",
" qtr | \n",
" down | \n",
" time | \n",
" TimeUnder | \n",
" TimeSecs | \n",
" PlayTimeDiff | \n",
" SideofField | \n",
" ... | \n",
" yacEPA | \n",
" Home_WP_pre | \n",
" Away_WP_pre | \n",
" Home_WP_post | \n",
" Away_WP_post | \n",
" Win_Prob | \n",
" WPA | \n",
" airWPA | \n",
" yacWPA | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
0 rows × 102 columns
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, ...]\n",
"Index: []\n",
"\n",
"[0 rows x 102 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# remove all the rows that contain a missing value\n",
"nfl_data.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the *columns* that have at least one missing value instead."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:40.401755Z",
"iopub.status.busy": "2021-08-21T21:40:40.401350Z",
"iopub.status.idle": "2021-08-21T21:40:41.556634Z",
"shell.execute_reply": "2021-08-21T21:40:41.555530Z",
"shell.execute_reply.started": "2021-08-21T21:40:40.401727Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" GameID | \n",
" Drive | \n",
" qtr | \n",
" TimeUnder | \n",
" ydstogo | \n",
" ydsnet | \n",
" PlayAttempted | \n",
" Yards.Gained | \n",
" sp | \n",
" ... | \n",
" Timeout_Indicator | \n",
" Timeout_Team | \n",
" posteam_timeouts_pre | \n",
" HomeTimeouts_Remaining_Pre | \n",
" AwayTimeouts_Remaining_Pre | \n",
" HomeTimeouts_Remaining_Post | \n",
" AwayTimeouts_Remaining_Post | \n",
" ExPoint_Prob | \n",
" TwoPoint_Prob | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 15 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 39 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" None | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2009 | \n",
"
\n",
" \n",
" 1 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 15 | \n",
" 10 | \n",
" 5 | \n",
" 1 | \n",
" 5 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" None | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2009 | \n",
"
\n",
" \n",
" 2 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 15 | \n",
" 5 | \n",
" 2 | \n",
" 1 | \n",
" -3 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" None | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 14 | \n",
" 8 | \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" None | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" 2009-09-10 | \n",
" 2009091000 | \n",
" 1 | \n",
" 1 | \n",
" 14 | \n",
" 8 | \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" None | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 41 columns
\n",
"
"
],
"text/plain": [
" Date GameID Drive qtr TimeUnder ydstogo ydsnet \\\n",
"0 2009-09-10 2009091000 1 1 15 0 0 \n",
"1 2009-09-10 2009091000 1 1 15 10 5 \n",
"2 2009-09-10 2009091000 1 1 15 5 2 \n",
"3 2009-09-10 2009091000 1 1 14 8 2 \n",
"4 2009-09-10 2009091000 1 1 14 8 2 \n",
"\n",
" PlayAttempted Yards.Gained sp ... Timeout_Indicator Timeout_Team \\\n",
"0 1 39 0 ... 0 None \n",
"1 1 5 0 ... 0 None \n",
"2 1 -3 0 ... 0 None \n",
"3 1 0 0 ... 0 None \n",
"4 1 0 0 ... 0 None \n",
"\n",
" posteam_timeouts_pre HomeTimeouts_Remaining_Pre AwayTimeouts_Remaining_Pre \\\n",
"0 3 3 3 \n",
"1 3 3 3 \n",
"2 3 3 3 \n",
"3 3 3 3 \n",
"4 3 3 3 \n",
"\n",
" HomeTimeouts_Remaining_Post AwayTimeouts_Remaining_Post ExPoint_Prob \\\n",
"0 3 3 0.0 \n",
"1 3 3 0.0 \n",
"2 3 3 0.0 \n",
"3 3 3 0.0 \n",
"4 3 3 0.0 \n",
"\n",
" TwoPoint_Prob Season \n",
"0 0.0 2009 \n",
"1 0.0 2009 \n",
"2 0.0 2009 \n",
"3 0.0 2009 \n",
"4 0.0 2009 \n",
"\n",
"[5 rows x 41 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# remove all columns with at least one missing value\n",
"columns_with_na_dropped = nfl_data.dropna(axis=1)\n",
"columns_with_na_dropped.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:41.558748Z",
"iopub.status.busy": "2021-08-21T21:40:41.558325Z",
"iopub.status.idle": "2021-08-21T21:40:41.564737Z",
"shell.execute_reply": "2021-08-21T21:40:41.563861Z",
"shell.execute_reply.started": "2021-08-21T21:40:41.558718Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Columns in original dataset: 102 \n",
"\n",
"Columns with na's dropped: 41\n"
]
}
],
"source": [
"# just how much data did we lose?\n",
"print(\"Columns in original dataset: %d \\n\" % nfl_data.shape[1])\n",
"print(\"Columns with na's dropped: %d\" % columns_with_na_dropped.shape[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. \n",
"\n",
"# Filling in missing values automatically\n",
"\n",
"Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:41.567194Z",
"iopub.status.busy": "2021-08-21T21:40:41.566583Z",
"iopub.status.idle": "2021-08-21T21:40:41.611960Z",
"shell.execute_reply": "2021-08-21T21:40:41.611176Z",
"shell.execute_reply.started": "2021-08-21T21:40:41.567156Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" EPA | \n",
" airEPA | \n",
" yacEPA | \n",
" Home_WP_pre | \n",
" Away_WP_pre | \n",
" Home_WP_post | \n",
" Away_WP_post | \n",
" Win_Prob | \n",
" WPA | \n",
" airWPA | \n",
" yacWPA | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.014474 | \n",
" NaN | \n",
" NaN | \n",
" 0.485675 | \n",
" 0.514325 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.485675 | \n",
" 0.060758 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.077907 | \n",
" -1.068169 | \n",
" 1.146076 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.546433 | \n",
" 0.004655 | \n",
" -0.032244 | \n",
" 0.036899 | \n",
" 2009 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.402760 | \n",
" NaN | \n",
" NaN | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.551088 | \n",
" -0.040295 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" -1.712583 | \n",
" 3.318841 | \n",
" -5.031425 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.510793 | \n",
" -0.049576 | \n",
" 0.106663 | \n",
" -0.156239 | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.097796 | \n",
" NaN | \n",
" NaN | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.558929 | \n",
" 0.441071 | \n",
" 0.461217 | \n",
" 0.097712 | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post \\\n",
"0 2.014474 NaN NaN 0.485675 0.514325 0.546433 \n",
"1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 \n",
"2 -1.402760 NaN NaN 0.551088 0.448912 0.510793 \n",
"3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 \n",
"4 2.097796 NaN NaN 0.461217 0.538783 0.558929 \n",
"\n",
" Away_WP_post Win_Prob WPA airWPA yacWPA Season \n",
"0 0.453567 0.485675 0.060758 NaN NaN 2009 \n",
"1 0.448912 0.546433 0.004655 -0.032244 0.036899 2009 \n",
"2 0.489207 0.551088 -0.040295 NaN NaN 2009 \n",
"3 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009 \n",
"4 0.441071 0.461217 0.097712 NaN NaN 2009 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get a small subset of the NFL dataset\n",
"subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()\n",
"subset_nfl_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the Panda's `fillna()` function to fill in missing values in a dataframe for us. One option we have is to specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:41.613715Z",
"iopub.status.busy": "2021-08-21T21:40:41.613205Z",
"iopub.status.idle": "2021-08-21T21:40:41.632696Z",
"shell.execute_reply": "2021-08-21T21:40:41.631477Z",
"shell.execute_reply.started": "2021-08-21T21:40:41.613681Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" EPA | \n",
" airEPA | \n",
" yacEPA | \n",
" Home_WP_pre | \n",
" Away_WP_pre | \n",
" Home_WP_post | \n",
" Away_WP_post | \n",
" Win_Prob | \n",
" WPA | \n",
" airWPA | \n",
" yacWPA | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.014474 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.485675 | \n",
" 0.514325 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.485675 | \n",
" 0.060758 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 2009 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.077907 | \n",
" -1.068169 | \n",
" 1.146076 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.546433 | \n",
" 0.004655 | \n",
" -0.032244 | \n",
" 0.036899 | \n",
" 2009 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.402760 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.551088 | \n",
" -0.040295 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" -1.712583 | \n",
" 3.318841 | \n",
" -5.031425 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.510793 | \n",
" -0.049576 | \n",
" 0.106663 | \n",
" -0.156239 | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.097796 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.558929 | \n",
" 0.441071 | \n",
" 0.461217 | \n",
" 0.097712 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post \\\n",
"0 2.014474 0.000000 0.000000 0.485675 0.514325 0.546433 \n",
"1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 \n",
"2 -1.402760 0.000000 0.000000 0.551088 0.448912 0.510793 \n",
"3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 \n",
"4 2.097796 0.000000 0.000000 0.461217 0.538783 0.558929 \n",
"\n",
" Away_WP_post Win_Prob WPA airWPA yacWPA Season \n",
"0 0.453567 0.485675 0.060758 0.000000 0.000000 2009 \n",
"1 0.448912 0.546433 0.004655 -0.032244 0.036899 2009 \n",
"2 0.489207 0.551088 -0.040295 0.000000 0.000000 2009 \n",
"3 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009 \n",
"4 0.441071 0.461217 0.097712 0.000000 0.000000 2009 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# replace all NA's with 0\n",
"subset_nfl_data.fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2021-08-21T21:40:41.635168Z",
"iopub.status.busy": "2021-08-21T21:40:41.634630Z",
"iopub.status.idle": "2021-08-21T21:40:41.664025Z",
"shell.execute_reply": "2021-08-21T21:40:41.662939Z",
"shell.execute_reply.started": "2021-08-21T21:40:41.635131Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" EPA | \n",
" airEPA | \n",
" yacEPA | \n",
" Home_WP_pre | \n",
" Away_WP_pre | \n",
" Home_WP_post | \n",
" Away_WP_post | \n",
" Win_Prob | \n",
" WPA | \n",
" airWPA | \n",
" yacWPA | \n",
" Season | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.014474 | \n",
" -1.068169 | \n",
" 1.146076 | \n",
" 0.485675 | \n",
" 0.514325 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.485675 | \n",
" 0.060758 | \n",
" -0.032244 | \n",
" 0.036899 | \n",
" 2009 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.077907 | \n",
" -1.068169 | \n",
" 1.146076 | \n",
" 0.546433 | \n",
" 0.453567 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.546433 | \n",
" 0.004655 | \n",
" -0.032244 | \n",
" 0.036899 | \n",
" 2009 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.402760 | \n",
" 3.318841 | \n",
" -5.031425 | \n",
" 0.551088 | \n",
" 0.448912 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.551088 | \n",
" -0.040295 | \n",
" 0.106663 | \n",
" -0.156239 | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" -1.712583 | \n",
" 3.318841 | \n",
" -5.031425 | \n",
" 0.510793 | \n",
" 0.489207 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.510793 | \n",
" -0.049576 | \n",
" 0.106663 | \n",
" -0.156239 | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" 2.097796 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.461217 | \n",
" 0.538783 | \n",
" 0.558929 | \n",
" 0.441071 | \n",
" 0.461217 | \n",
" 0.097712 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post \\\n",
"0 2.014474 -1.068169 1.146076 0.485675 0.514325 0.546433 \n",
"1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 \n",
"2 -1.402760 3.318841 -5.031425 0.551088 0.448912 0.510793 \n",
"3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 \n",
"4 2.097796 0.000000 0.000000 0.461217 0.538783 0.558929 \n",
"\n",
" Away_WP_post Win_Prob WPA airWPA yacWPA Season \n",
"0 0.453567 0.485675 0.060758 -0.032244 0.036899 2009 \n",
"1 0.448912 0.546433 0.004655 -0.032244 0.036899 2009 \n",
"2 0.489207 0.551088 -0.040295 0.106663 -0.156239 2009 \n",
"3 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009 \n",
"4 0.441071 0.461217 0.097712 0.000000 0.000000 2009 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# replace all NA's the value that comes directly after it in the same column, \n",
"# then replace all the remaining na's with 0\n",
"subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Your turn\n",
"\n",
"Write your own code to [**deal with missing values**](https://www.kaggle.com/kernels/fork/10824396) in a dataset of building permits issued in San Francisco."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"\n",
"*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/172650) to chat with other Learners.*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}