{ "cells": [ { "cell_type": "markdown", "id": "focused-flour", "metadata": { "papermill": { "duration": 0.005665, "end_time": "2021-06-04T13:12:49.538115", "exception": false, "start_time": "2021-06-04T13:12:49.532450", "status": "completed" }, "tags": [] }, "source": [ "# Using Pandas to Get Familiar With Your Data\n", "\n", "The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as `pd`. We do this with the command" ] }, { "cell_type": "code", "execution_count": 1, "id": "major-friendly", "metadata": { "collapsed": true, "execution": { "iopub.execute_input": "2021-06-04T13:12:49.554657Z", "iopub.status.busy": "2021-06-04T13:12:49.553764Z", "iopub.status.idle": "2021-06-04T13:12:49.557160Z", "shell.execute_reply": "2021-06-04T13:12:49.556569Z" }, "jupyter": { "outputs_hidden": true }, "papermill": { "duration": 0.014261, "end_time": "2021-06-04T13:12:49.557323", "exception": false, "start_time": "2021-06-04T13:12:49.543062", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "speaking-journalism", "metadata": { "papermill": { "duration": 0.004569, "end_time": "2021-06-04T13:12:49.567093", "exception": false, "start_time": "2021-06-04T13:12:49.562524", "status": "completed" }, "tags": [] }, "source": [ "The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. \n", "\n", "Pandas has powerful methods for most things you'll want to do with this type of data. \n", "\n", "As an example, we'll look at [data about home prices](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot) in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.\n", "\n", "The example (Melbourne) data is at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.\n", "\n", "We load and explore the data with the following commands:" ] }, { "cell_type": "code", "execution_count": 2, "id": "level-reference", "metadata": { "collapsed": true, "execution": { "iopub.execute_input": "2021-06-04T13:12:49.581805Z", "iopub.status.busy": "2021-06-04T13:12:49.581209Z", "iopub.status.idle": "2021-06-04T13:12:49.755456Z", "shell.execute_reply": "2021-06-04T13:12:49.755931Z" }, "jupyter": { "outputs_hidden": true }, "papermill": { "duration": 0.184292, "end_time": "2021-06-04T13:12:49.756102", "exception": false, "start_time": "2021-06-04T13:12:49.571810", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | Rooms | \n", "Price | \n", "Distance | \n", "Postcode | \n", "Bedroom2 | \n", "Bathroom | \n", "Car | \n", "Landsize | \n", "BuildingArea | \n", "YearBuilt | \n", "Lattitude | \n", "Longtitude | \n", "Propertycount | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "13580.000000 | \n", "1.358000e+04 | \n", "13580.000000 | \n", "13580.000000 | \n", "13580.000000 | \n", "13580.000000 | \n", "13518.000000 | \n", "13580.000000 | \n", "7130.000000 | \n", "8205.000000 | \n", "13580.000000 | \n", "13580.000000 | \n", "13580.000000 | \n", "
mean | \n", "2.937997 | \n", "1.075684e+06 | \n", "10.137776 | \n", "3105.301915 | \n", "2.914728 | \n", "1.534242 | \n", "1.610075 | \n", "558.416127 | \n", "151.967650 | \n", "1964.684217 | \n", "-37.809203 | \n", "144.995216 | \n", "7454.417378 | \n", "
std | \n", "0.955748 | \n", "6.393107e+05 | \n", "5.868725 | \n", "90.676964 | \n", "0.965921 | \n", "0.691712 | \n", "0.962634 | \n", "3990.669241 | \n", "541.014538 | \n", "37.273762 | \n", "0.079260 | \n", "0.103916 | \n", "4378.581772 | \n", "
min | \n", "1.000000 | \n", "8.500000e+04 | \n", "0.000000 | \n", "3000.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1196.000000 | \n", "-38.182550 | \n", "144.431810 | \n", "249.000000 | \n", "
25% | \n", "2.000000 | \n", "6.500000e+05 | \n", "6.100000 | \n", "3044.000000 | \n", "2.000000 | \n", "1.000000 | \n", "1.000000 | \n", "177.000000 | \n", "93.000000 | \n", "1940.000000 | \n", "-37.856822 | \n", "144.929600 | \n", "4380.000000 | \n", "
50% | \n", "3.000000 | \n", "9.030000e+05 | \n", "9.200000 | \n", "3084.000000 | \n", "3.000000 | \n", "1.000000 | \n", "2.000000 | \n", "440.000000 | \n", "126.000000 | \n", "1970.000000 | \n", "-37.802355 | \n", "145.000100 | \n", "6555.000000 | \n", "
75% | \n", "3.000000 | \n", "1.330000e+06 | \n", "13.000000 | \n", "3148.000000 | \n", "3.000000 | \n", "2.000000 | \n", "2.000000 | \n", "651.000000 | \n", "174.000000 | \n", "1999.000000 | \n", "-37.756400 | \n", "145.058305 | \n", "10331.000000 | \n", "
max | \n", "10.000000 | \n", "9.000000e+06 | \n", "48.100000 | \n", "3977.000000 | \n", "20.000000 | \n", "8.000000 | \n", "10.000000 | \n", "433014.000000 | \n", "44515.000000 | \n", "2018.000000 | \n", "-37.408530 | \n", "145.526350 | \n", "21650.000000 | \n", "