{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "focused-flour",
   "metadata": {
    "papermill": {
     "duration": 0.005665,
     "end_time": "2021-06-04T13:12:49.538115",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.532450",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Using Pandas to Get Familiar With Your Data\n",
    "\n",
    "The first step in any machine learning project is familiarize yourself with the data.  You'll use the Pandas library for this.  Pandas is the primary tool data scientists use for exploring and manipulating data.  Most people abbreviate pandas in their code as `pd`.  We do this with the command"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "major-friendly",
   "metadata": {
    "collapsed": true,
    "execution": {
     "iopub.execute_input": "2021-06-04T13:12:49.554657Z",
     "iopub.status.busy": "2021-06-04T13:12:49.553764Z",
     "iopub.status.idle": "2021-06-04T13:12:49.557160Z",
     "shell.execute_reply": "2021-06-04T13:12:49.556569Z"
    },
    "jupyter": {
     "outputs_hidden": true
    },
    "papermill": {
     "duration": 0.014261,
     "end_time": "2021-06-04T13:12:49.557323",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.543062",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "speaking-journalism",
   "metadata": {
    "papermill": {
     "duration": 0.004569,
     "end_time": "2021-06-04T13:12:49.567093",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.562524",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The most important part of the Pandas library is the DataFrame.  A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. \n",
    "\n",
    "Pandas has powerful methods for most things you'll want to do with this type of data.  \n",
    "\n",
    "As an example, we'll look at [data about home prices](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot) in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.\n",
    "\n",
    "The example (Melbourne) data is at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.\n",
    "\n",
    "We load and explore the data with the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "level-reference",
   "metadata": {
    "collapsed": true,
    "execution": {
     "iopub.execute_input": "2021-06-04T13:12:49.581805Z",
     "iopub.status.busy": "2021-06-04T13:12:49.581209Z",
     "iopub.status.idle": "2021-06-04T13:12:49.755456Z",
     "shell.execute_reply": "2021-06-04T13:12:49.755931Z"
    },
    "jupyter": {
     "outputs_hidden": true
    },
    "papermill": {
     "duration": 0.184292,
     "end_time": "2021-06-04T13:12:49.756102",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.571810",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rooms</th>\n",
       "      <th>Price</th>\n",
       "      <th>Distance</th>\n",
       "      <th>Postcode</th>\n",
       "      <th>Bedroom2</th>\n",
       "      <th>Bathroom</th>\n",
       "      <th>Car</th>\n",
       "      <th>Landsize</th>\n",
       "      <th>BuildingArea</th>\n",
       "      <th>YearBuilt</th>\n",
       "      <th>Lattitude</th>\n",
       "      <th>Longtitude</th>\n",
       "      <th>Propertycount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>13580.000000</td>\n",
       "      <td>1.358000e+04</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13518.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>7130.000000</td>\n",
       "      <td>8205.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "      <td>13580.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.937997</td>\n",
       "      <td>1.075684e+06</td>\n",
       "      <td>10.137776</td>\n",
       "      <td>3105.301915</td>\n",
       "      <td>2.914728</td>\n",
       "      <td>1.534242</td>\n",
       "      <td>1.610075</td>\n",
       "      <td>558.416127</td>\n",
       "      <td>151.967650</td>\n",
       "      <td>1964.684217</td>\n",
       "      <td>-37.809203</td>\n",
       "      <td>144.995216</td>\n",
       "      <td>7454.417378</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.955748</td>\n",
       "      <td>6.393107e+05</td>\n",
       "      <td>5.868725</td>\n",
       "      <td>90.676964</td>\n",
       "      <td>0.965921</td>\n",
       "      <td>0.691712</td>\n",
       "      <td>0.962634</td>\n",
       "      <td>3990.669241</td>\n",
       "      <td>541.014538</td>\n",
       "      <td>37.273762</td>\n",
       "      <td>0.079260</td>\n",
       "      <td>0.103916</td>\n",
       "      <td>4378.581772</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>8.500000e+04</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1196.000000</td>\n",
       "      <td>-38.182550</td>\n",
       "      <td>144.431810</td>\n",
       "      <td>249.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>6.500000e+05</td>\n",
       "      <td>6.100000</td>\n",
       "      <td>3044.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>177.000000</td>\n",
       "      <td>93.000000</td>\n",
       "      <td>1940.000000</td>\n",
       "      <td>-37.856822</td>\n",
       "      <td>144.929600</td>\n",
       "      <td>4380.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>9.030000e+05</td>\n",
       "      <td>9.200000</td>\n",
       "      <td>3084.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>440.000000</td>\n",
       "      <td>126.000000</td>\n",
       "      <td>1970.000000</td>\n",
       "      <td>-37.802355</td>\n",
       "      <td>145.000100</td>\n",
       "      <td>6555.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.330000e+06</td>\n",
       "      <td>13.000000</td>\n",
       "      <td>3148.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>651.000000</td>\n",
       "      <td>174.000000</td>\n",
       "      <td>1999.000000</td>\n",
       "      <td>-37.756400</td>\n",
       "      <td>145.058305</td>\n",
       "      <td>10331.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>10.000000</td>\n",
       "      <td>9.000000e+06</td>\n",
       "      <td>48.100000</td>\n",
       "      <td>3977.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>433014.000000</td>\n",
       "      <td>44515.000000</td>\n",
       "      <td>2018.000000</td>\n",
       "      <td>-37.408530</td>\n",
       "      <td>145.526350</td>\n",
       "      <td>21650.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Rooms         Price      Distance      Postcode      Bedroom2  \\\n",
       "count  13580.000000  1.358000e+04  13580.000000  13580.000000  13580.000000   \n",
       "mean       2.937997  1.075684e+06     10.137776   3105.301915      2.914728   \n",
       "std        0.955748  6.393107e+05      5.868725     90.676964      0.965921   \n",
       "min        1.000000  8.500000e+04      0.000000   3000.000000      0.000000   \n",
       "25%        2.000000  6.500000e+05      6.100000   3044.000000      2.000000   \n",
       "50%        3.000000  9.030000e+05      9.200000   3084.000000      3.000000   \n",
       "75%        3.000000  1.330000e+06     13.000000   3148.000000      3.000000   \n",
       "max       10.000000  9.000000e+06     48.100000   3977.000000     20.000000   \n",
       "\n",
       "           Bathroom           Car       Landsize  BuildingArea    YearBuilt  \\\n",
       "count  13580.000000  13518.000000   13580.000000   7130.000000  8205.000000   \n",
       "mean       1.534242      1.610075     558.416127    151.967650  1964.684217   \n",
       "std        0.691712      0.962634    3990.669241    541.014538    37.273762   \n",
       "min        0.000000      0.000000       0.000000      0.000000  1196.000000   \n",
       "25%        1.000000      1.000000     177.000000     93.000000  1940.000000   \n",
       "50%        1.000000      2.000000     440.000000    126.000000  1970.000000   \n",
       "75%        2.000000      2.000000     651.000000    174.000000  1999.000000   \n",
       "max        8.000000     10.000000  433014.000000  44515.000000  2018.000000   \n",
       "\n",
       "          Lattitude    Longtitude  Propertycount  \n",
       "count  13580.000000  13580.000000   13580.000000  \n",
       "mean     -37.809203    144.995216    7454.417378  \n",
       "std        0.079260      0.103916    4378.581772  \n",
       "min      -38.182550    144.431810     249.000000  \n",
       "25%      -37.856822    144.929600    4380.000000  \n",
       "50%      -37.802355    145.000100    6555.000000  \n",
       "75%      -37.756400    145.058305   10331.000000  \n",
       "max      -37.408530    145.526350   21650.000000  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# save filepath to variable for easier access\n",
    "melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'\n",
    "# read the data and store data in DataFrame titled melbourne_data\n",
    "melbourne_data = pd.read_csv(melbourne_file_path) \n",
    "# print a summary of the data in Melbourne data\n",
    "melbourne_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "potential-ordinary",
   "metadata": {
    "papermill": {
     "duration": 0.00519,
     "end_time": "2021-06-04T13:12:49.767034",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.761844",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Interpreting Data Description\n",
    "The results show 8 numbers for each column in your original dataset. The first number, the **count**,  shows how many rows have non-missing values.  \n",
    "\n",
    "Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.\n",
    "\n",
    "The second value is the **mean**, which is the average.  Under that, **std** is the standard deviation, which measures how numerically spread out the values are.\n",
    "\n",
    "To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value.  The first (smallest) value is the min.  If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  That is the **25%** value (pronounced \"25th percentile\").  The 50th and 75th percentiles are defined analogously, and the **max** is the largest number.\n",
    "\n",
    "\n",
    "# Your Turn\n",
    "Get started with your **[first coding exercise](https://www.kaggle.com/kernels/fork/1258954)**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "corrected-madrid",
   "metadata": {
    "papermill": {
     "duration": 0.005059,
     "end_time": "2021-06-04T13:12:49.777572",
     "exception": false,
     "start_time": "2021-06-04T13:12:49.772513",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "---\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 7.794213,
   "end_time": "2021-06-04T13:12:51.331858",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-06-04T13:12:43.537645",
   "version": "2.3.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}