{ "cells": [ { "cell_type": "markdown", "id": "focused-flour", "metadata": { "papermill": { "duration": 0.005665, "end_time": "2021-06-04T13:12:49.538115", "exception": false, "start_time": "2021-06-04T13:12:49.532450", "status": "completed" }, "tags": [] }, "source": [ "# Using Pandas to Get Familiar With Your Data\n", "\n", "The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as `pd`. We do this with the command" ] }, { "cell_type": "code", "execution_count": 1, "id": "major-friendly", "metadata": { "collapsed": true, "execution": { "iopub.execute_input": "2021-06-04T13:12:49.554657Z", "iopub.status.busy": "2021-06-04T13:12:49.553764Z", "iopub.status.idle": "2021-06-04T13:12:49.557160Z", "shell.execute_reply": "2021-06-04T13:12:49.556569Z" }, "jupyter": { "outputs_hidden": true }, "papermill": { "duration": 0.014261, "end_time": "2021-06-04T13:12:49.557323", "exception": false, "start_time": "2021-06-04T13:12:49.543062", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "speaking-journalism", "metadata": { "papermill": { "duration": 0.004569, "end_time": "2021-06-04T13:12:49.567093", "exception": false, "start_time": "2021-06-04T13:12:49.562524", "status": "completed" }, "tags": [] }, "source": [ "The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. \n", "\n", "Pandas has powerful methods for most things you'll want to do with this type of data. \n", "\n", "As an example, we'll look at [data about home prices](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot) in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.\n", "\n", "The example (Melbourne) data is at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.\n", "\n", "We load and explore the data with the following commands:" ] }, { "cell_type": "code", "execution_count": 2, "id": "level-reference", "metadata": { "collapsed": true, "execution": { "iopub.execute_input": "2021-06-04T13:12:49.581805Z", "iopub.status.busy": "2021-06-04T13:12:49.581209Z", "iopub.status.idle": "2021-06-04T13:12:49.755456Z", "shell.execute_reply": "2021-06-04T13:12:49.755931Z" }, "jupyter": { "outputs_hidden": true }, "papermill": { "duration": 0.184292, "end_time": "2021-06-04T13:12:49.756102", "exception": false, "start_time": "2021-06-04T13:12:49.571810", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RoomsPriceDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltLattitudeLongtitudePropertycount
count13580.0000001.358000e+0413580.00000013580.00000013580.00000013580.00000013518.00000013580.0000007130.0000008205.00000013580.00000013580.00000013580.000000
mean2.9379971.075684e+0610.1377763105.3019152.9147281.5342421.610075558.416127151.9676501964.684217-37.809203144.9952167454.417378
std0.9557486.393107e+055.86872590.6769640.9659210.6917120.9626343990.669241541.01453837.2737620.0792600.1039164378.581772
min1.0000008.500000e+040.0000003000.0000000.0000000.0000000.0000000.0000000.0000001196.000000-38.182550144.431810249.000000
25%2.0000006.500000e+056.1000003044.0000002.0000001.0000001.000000177.00000093.0000001940.000000-37.856822144.9296004380.000000
50%3.0000009.030000e+059.2000003084.0000003.0000001.0000002.000000440.000000126.0000001970.000000-37.802355145.0001006555.000000
75%3.0000001.330000e+0613.0000003148.0000003.0000002.0000002.000000651.000000174.0000001999.000000-37.756400145.05830510331.000000
max10.0000009.000000e+0648.1000003977.00000020.0000008.00000010.000000433014.00000044515.0000002018.000000-37.408530145.52635021650.000000
\n", "
" ], "text/plain": [ " Rooms Price Distance Postcode Bedroom2 \\\n", "count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 \n", "mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 \n", "std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 \n", "min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 \n", "25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 \n", "50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 \n", "75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 \n", "max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 \n", "\n", " Bathroom Car Landsize BuildingArea YearBuilt \\\n", "count 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 \n", "mean 1.534242 1.610075 558.416127 151.967650 1964.684217 \n", "std 0.691712 0.962634 3990.669241 541.014538 37.273762 \n", "min 0.000000 0.000000 0.000000 0.000000 1196.000000 \n", "25% 1.000000 1.000000 177.000000 93.000000 1940.000000 \n", "50% 1.000000 2.000000 440.000000 126.000000 1970.000000 \n", "75% 2.000000 2.000000 651.000000 174.000000 1999.000000 \n", "max 8.000000 10.000000 433014.000000 44515.000000 2018.000000 \n", "\n", " Lattitude Longtitude Propertycount \n", "count 13580.000000 13580.000000 13580.000000 \n", "mean -37.809203 144.995216 7454.417378 \n", "std 0.079260 0.103916 4378.581772 \n", "min -38.182550 144.431810 249.000000 \n", "25% -37.856822 144.929600 4380.000000 \n", "50% -37.802355 145.000100 6555.000000 \n", "75% -37.756400 145.058305 10331.000000 \n", "max -37.408530 145.526350 21650.000000 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# save filepath to variable for easier access\n", "melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'\n", "# read the data and store data in DataFrame titled melbourne_data\n", "melbourne_data = pd.read_csv(melbourne_file_path) \n", "# print a summary of the data in Melbourne data\n", "melbourne_data.describe()" ] }, { "cell_type": "markdown", "id": "potential-ordinary", "metadata": { "papermill": { "duration": 0.00519, "end_time": "2021-06-04T13:12:49.767034", "exception": false, "start_time": "2021-06-04T13:12:49.761844", "status": "completed" }, "tags": [] }, "source": [ "# Interpreting Data Description\n", "The results show 8 numbers for each column in your original dataset. The first number, the **count**, shows how many rows have non-missing values. \n", "\n", "Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.\n", "\n", "The second value is the **mean**, which is the average. Under that, **std** is the standard deviation, which measures how numerically spread out the values are.\n", "\n", "To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the **25%** value (pronounced \"25th percentile\"). The 50th and 75th percentiles are defined analogously, and the **max** is the largest number.\n", "\n", "\n", "# Your Turn\n", "Get started with your **[first coding exercise](https://www.kaggle.com/kernels/fork/1258954)**" ] }, { "cell_type": "markdown", "id": "corrected-madrid", "metadata": { "papermill": { "duration": 0.005059, "end_time": "2021-06-04T13:12:49.777572", "exception": false, "start_time": "2021-06-04T13:12:49.772513", "status": "completed" }, "tags": [] }, "source": [ "---\n", "\n", "\n", "\n", "\n", "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "papermill": { "default_parameters": {}, "duration": 7.794213, "end_time": "2021-06-04T13:12:51.331858", "environment_variables": {}, "exception": null, "input_path": "__notebook__.ipynb", "output_path": "__notebook__.ipynb", "parameters": {}, "start_time": "2021-06-04T13:12:43.537645", "version": "2.3.2" } }, "nbformat": 4, "nbformat_minor": 5 }