{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**This notebook is an exercise in the [Pandas](https://www.kaggle.com/learn/pandas) course. You can reference the tutorial at [this link](https://www.kaggle.com/residentmario/grouping-and-sorting).**\n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In these exercises we'll apply groupwise analysis to our dataset.\n", "\n", "Run the code cell below to load the data before running the exercises." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setup complete.\n" ] } ], "source": [ "import pandas as pd\n", "\n", "reviews = pd.read_csv(\"../input/wine-reviews/winemag-data-130k-v2.csv\", index_col=0)\n", "#pd.set_option(\"display.max_rows\", 5)\n", "\n", "from learntools.core import binder; binder.bind(globals())\n", "from learntools.pandas.grouping_and_sorting import *\n", "print(\"Setup complete.\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
1PortugalThis is ripe and fruity, a wine that is smooth...Avidagos8715.0DouroNaNNaNRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese RedQuinta dos Avidagos
2USTart and snappy, the flavors of lime flesh and...NaN8714.0OregonWillamette ValleyWillamette ValleyPaul Gregutt@paulgwineRainstorm 2013 Pinot Gris (Willamette Valley)Pinot GrisRainstorm
3USPineapple rind, lemon pith and orange blossom ...Reserve Late Harvest8713.0MichiganLake Michigan ShoreNaNAlexander PeartreeNaNSt. Julian 2013 Reserve Late Harvest Riesling ...RieslingSt. Julian
4USMuch like the regular bottling from 2012, this...Vintner's Reserve Wild Child Block8765.0OregonWillamette ValleyWillamette ValleyPaul Gregutt@paulgwineSweet Cheeks 2012 Vintner's Reserve Wild Child...Pinot NoirSweet Cheeks
\n", "
" ], "text/plain": [ " country description \\\n", "0 Italy Aromas include tropical fruit, broom, brimston... \n", "1 Portugal This is ripe and fruity, a wine that is smooth... \n", "2 US Tart and snappy, the flavors of lime flesh and... \n", "3 US Pineapple rind, lemon pith and orange blossom ... \n", "4 US Much like the regular bottling from 2012, this... \n", "\n", " designation points price province \\\n", "0 Vulkà Bianco 87 NaN Sicily & Sardinia \n", "1 Avidagos 87 15.0 Douro \n", "2 NaN 87 14.0 Oregon \n", "3 Reserve Late Harvest 87 13.0 Michigan \n", "4 Vintner's Reserve Wild Child Block 87 65.0 Oregon \n", "\n", " region_1 region_2 taster_name \\\n", "0 Etna NaN Kerin O’Keefe \n", "1 NaN NaN Roger Voss \n", "2 Willamette Valley Willamette Valley Paul Gregutt \n", "3 Lake Michigan Shore NaN Alexander Peartree \n", "4 Willamette Valley Willamette Valley Paul Gregutt \n", "\n", " taster_twitter_handle title \\\n", "0 @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) \n", "1 @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) \n", "2 @paulgwine  Rainstorm 2013 Pinot Gris (Willamette Valley) \n", "3 NaN St. Julian 2013 Reserve Late Harvest Riesling ... \n", "4 @paulgwine  Sweet Cheeks 2012 Vintner's Reserve Wild Child... \n", "\n", " variety winery \n", "0 White Blend Nicosia \n", "1 Portuguese Red Quinta dos Avidagos \n", "2 Pinot Gris Rainstorm \n", "3 Riesling St. Julian \n", "4 Pinot Noir Sweet Cheeks " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.\n", "Who are the most common wine reviewers in the dataset? Create a `Series` whose index is the `taster_twitter_handle` category from the dataset, and whose values count how many reviews each person wrote." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"1_ReviewsWritten\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct: \n", "\n", "\n", "```python\n", "reviews_written = reviews.groupby('taster_twitter_handle').size()\n", "```\n", "or\n", "```python\n", "reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()\n", "```\n" ], "text/plain": [ "Correct: \n", "\n", "\n", "```python\n", "reviews_written = reviews.groupby('taster_twitter_handle').size()\n", "```\n", "or\n", "```python\n", "reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()\n", "```" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Your code here\n", "reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()\n", "# reviews_written = reviews.groupby('taster_twitter_handle').size()\n", "\n", "\n", "# Check your answer\n", "q1.check()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 2, \"questionType\": 1, \"questionId\": \"1_ReviewsWritten\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Hint: Use the `groupby` operation and `size()` (or `count()`)." ], "text/plain": [ "Hint: Use the `groupby` operation and `size()` (or `count()`)." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "q1.hint()\n", "#q1.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.\n", "What is the best wine I can buy for a given amount of money? Create a `Series` whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that `4.0` dollars is at the top and `3300.0` dollars is at the bottom)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"2_BestRatingPerPrice\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "best_rating_per_price = reviews.groupby('price').points.max()\n", "\n", "# Check your answer\n", "q2.check()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#q2.hint()\n", "#q2.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.\n", "What are the minimum and maximum prices for each `variety` of wine? Create a `DataFrame` whose index is the `variety` category from the dataset and whose values are the `min` and `max` values thereof." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"3_PriceExtremes\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "price_extremes = reviews.groupby('variety').price.agg(['min', 'max'])\n", "\n", "# Check your answer\n", "q3.check()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 2, \"questionType\": 1, \"questionId\": \"3_PriceExtremes\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Hint: Use `agg()`." ], "text/plain": [ "Hint: Use `agg()`." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 3, \"questionType\": 1, \"questionId\": \"3_PriceExtremes\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Solution: \n", "```python\n", "price_extremes = reviews.groupby('variety').price.agg([min, max])\n", "```" ], "text/plain": [ "Solution: \n", "```python\n", "price_extremes = reviews.groupby('variety').price.agg([min, max])\n", "```" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "q3.hint()\n", "q3.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.\n", "What are the most expensive wine varieties? Create a variable `sorted_varieties` containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"4_SortedVarieties\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sorted_varieties = price_extremes.sort_values(['min', 'max'], ascending=False)\n", "\n", "# Check your answer\n", "q4.check()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#q4.hint()\n", "#q4.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.\n", "Create a `Series` whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the `taster_name` and `points` columns." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "85.855421686747\n" ] }, { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"5_ReviewerMeanRatings\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()\n", "\n", "# Check your answer\n", "q5.check()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#q5.hint()\n", "#q5.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the `describe()` method to see a summary of the range of values." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 19.000000\n", "mean 88.233026\n", "std 1.243610\n", "min 85.855422\n", "25% 87.323501\n", "50% 88.536235\n", "75% 88.975256\n", "max 90.562551\n", "Name: points, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviewer_mean_ratings.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.\n", "What combination of countries and varieties are most common? Create a `Series` whose index is a `MultiIndex`of `{country, variety}` pairs. For example, a pinot noir produced in the US should map to `{\"US\", \"Pinot Noir\"}`. Sort the values in the `Series` in descending order based on wine count." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.16666666666666666, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"6_GroupbyCountryVariety\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Correct" ], "text/plain": [ "Correct" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)\n", "\n", "# Check your answer\n", "q6.check()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 2, \"questionType\": 1, \"questionId\": \"6_GroupbyCountryVariety\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "Hint: Use `groupby()`, and provide a list of columns to group by. You may also find `size()` and `sort_values()` to be useful!" ], "text/plain": [ "Hint: Use `groupby()`, and provide a list of columns to group by. You may also find `size()` and `sort_values()` to be useful!" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "q6.hint()\n", "#q6.solution()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Keep going\n", "\n", "Move on to the [**data types and missing data**](https://www.kaggle.com/residentmario/data-types-and-missing-values)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "\n", "\n", "\n", "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161299) to chat with other Learners.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }