{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "automatic-outline",
   "metadata": {
    "papermill": {
     "duration": 0.014651,
     "end_time": "2021-06-03T16:37:56.863936",
     "exception": false,
     "start_time": "2021-06-03T16:37:56.849285",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "In this notebook, we're going to learn how to clean up inconsistent text entries.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "listed-assistant",
   "metadata": {
    "papermill": {
     "duration": 0.01258,
     "end_time": "2021-06-03T16:37:56.889851",
     "exception": false,
     "start_time": "2021-06-03T16:37:56.877271",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Get our environment set up\n",
    "\n",
    "The first thing we'll need to do is load in the libraries and dataset we'll be using. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ideal-fountain",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:56.920562Z",
     "iopub.status.busy": "2021-06-03T16:37:56.919322Z",
     "iopub.status.idle": "2021-06-03T16:37:56.982421Z",
     "shell.execute_reply": "2021-06-03T16:37:56.983042Z"
    },
    "papermill": {
     "duration": 0.080465,
     "end_time": "2021-06-03T16:37:56.983410",
     "exception": false,
     "start_time": "2021-06-03T16:37:56.902945",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# modules we'll use\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# helpful modules\n",
    "import fuzzywuzzy\n",
    "from fuzzywuzzy import process\n",
    "import chardet\n",
    "\n",
    "# read in all our data\n",
    "professors = pd.read_csv(\"../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv\")\n",
    "\n",
    "# set seed for reproducibility\n",
    "np.random.seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "modern-florence",
   "metadata": {
    "papermill": {
     "duration": 0.013841,
     "end_time": "2021-06-03T16:37:57.010919",
     "exception": false,
     "start_time": "2021-06-03T16:37:56.997078",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Do some preliminary text pre-processing\n",
    "\n",
    "We'll begin by taking a quick look at the first few rows of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "heavy-album",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.041684Z",
     "iopub.status.busy": "2021-06-03T16:37:57.040937Z",
     "iopub.status.idle": "2021-06-03T16:37:57.078378Z",
     "shell.execute_reply": "2021-06-03T16:37:57.079001Z"
    },
    "papermill": {
     "duration": 0.054858,
     "end_time": "2021-06-03T16:37:57.079207",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.024349",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>S#</th>\n",
       "      <th>Teacher Name</th>\n",
       "      <th>University Currently Teaching</th>\n",
       "      <th>Department</th>\n",
       "      <th>Province University Located</th>\n",
       "      <th>Designation</th>\n",
       "      <th>Terminal Degree</th>\n",
       "      <th>Graduated from</th>\n",
       "      <th>Country</th>\n",
       "      <th>Year</th>\n",
       "      <th>Area of Specialization/Research Interests</th>\n",
       "      <th>Other Information</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>Dr. Abdul Basit</td>\n",
       "      <td>University of Balochistan</td>\n",
       "      <td>Computer Science &amp; IT</td>\n",
       "      <td>Balochistan</td>\n",
       "      <td>Assistant Professor</td>\n",
       "      <td>PhD</td>\n",
       "      <td>Asian Institute of Technology</td>\n",
       "      <td>Thailand</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Software Engineering &amp; DBMS</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>Dr. Waheed Noor</td>\n",
       "      <td>University of Balochistan</td>\n",
       "      <td>Computer Science &amp; IT</td>\n",
       "      <td>Balochistan</td>\n",
       "      <td>Assistant Professor</td>\n",
       "      <td>PhD</td>\n",
       "      <td>Asian Institute of Technology</td>\n",
       "      <td>Thailand</td>\n",
       "      <td>NaN</td>\n",
       "      <td>DBMS</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>Dr. Junaid Baber</td>\n",
       "      <td>University of Balochistan</td>\n",
       "      <td>Computer Science &amp; IT</td>\n",
       "      <td>Balochistan</td>\n",
       "      <td>Assistant Professor</td>\n",
       "      <td>PhD</td>\n",
       "      <td>Asian Institute of Technology</td>\n",
       "      <td>Thailand</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Information processing, Multimedia mining</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>Dr. Maheen Bakhtyar</td>\n",
       "      <td>University of Balochistan</td>\n",
       "      <td>Computer Science &amp; IT</td>\n",
       "      <td>Balochistan</td>\n",
       "      <td>Assistant Professor</td>\n",
       "      <td>PhD</td>\n",
       "      <td>Asian Institute of Technology</td>\n",
       "      <td>Thailand</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NLP, Information Retrieval, Question Answering...</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24</td>\n",
       "      <td>25</td>\n",
       "      <td>Samina Azim</td>\n",
       "      <td>Sardar Bahadur Khan Women's University</td>\n",
       "      <td>Computer Science</td>\n",
       "      <td>Balochistan</td>\n",
       "      <td>Lecturer</td>\n",
       "      <td>BS</td>\n",
       "      <td>Balochistan University of Information Technolo...</td>\n",
       "      <td>Pakistan</td>\n",
       "      <td>2005.0</td>\n",
       "      <td>VLSI Electronics DLD Database</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  S#         Teacher Name  \\\n",
       "0           2   3      Dr. Abdul Basit   \n",
       "1           4   5      Dr. Waheed Noor   \n",
       "2           5   6     Dr. Junaid Baber   \n",
       "3           6   7  Dr. Maheen Bakhtyar   \n",
       "4          24  25          Samina Azim   \n",
       "\n",
       "            University Currently Teaching             Department  \\\n",
       "0               University of Balochistan  Computer Science & IT   \n",
       "1               University of Balochistan  Computer Science & IT   \n",
       "2               University of Balochistan  Computer Science & IT   \n",
       "3               University of Balochistan  Computer Science & IT   \n",
       "4  Sardar Bahadur Khan Women's University       Computer Science   \n",
       "\n",
       "  Province University Located          Designation Terminal Degree  \\\n",
       "0                 Balochistan  Assistant Professor             PhD   \n",
       "1                 Balochistan  Assistant Professor             PhD   \n",
       "2                 Balochistan  Assistant Professor             PhD   \n",
       "3                 Balochistan  Assistant Professor             PhD   \n",
       "4                 Balochistan             Lecturer              BS   \n",
       "\n",
       "                                      Graduated from   Country    Year  \\\n",
       "0                      Asian Institute of Technology  Thailand     NaN   \n",
       "1                      Asian Institute of Technology  Thailand     NaN   \n",
       "2                      Asian Institute of Technology  Thailand     NaN   \n",
       "3                      Asian Institute of Technology  Thailand     NaN   \n",
       "4  Balochistan University of Information Technolo...  Pakistan  2005.0   \n",
       "\n",
       "           Area of Specialization/Research Interests Other Information  \n",
       "0                        Software Engineering & DBMS               NaN  \n",
       "1                                               DBMS               NaN  \n",
       "2          Information processing, Multimedia mining               NaN  \n",
       "3  NLP, Information Retrieval, Question Answering...               NaN  \n",
       "4                      VLSI Electronics DLD Database               NaN  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "professors.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sensitive-turkish",
   "metadata": {
    "papermill": {
     "duration": 0.013814,
     "end_time": "2021-06-03T16:37:57.107467",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.093653",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Say we're interested in cleaning up the \"Country\" column to make sure there's no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There's a more efficient way to do this, though!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fitted-relief",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.140832Z",
     "iopub.status.busy": "2021-06-03T16:37:57.140150Z",
     "iopub.status.idle": "2021-06-03T16:37:57.151629Z",
     "shell.execute_reply": "2021-06-03T16:37:57.151025Z"
    },
    "papermill": {
     "duration": 0.030226,
     "end_time": "2021-06-03T16:37:57.151814",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.121588",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',\n",
       "       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',\n",
       "       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',\n",
       "       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',\n",
       "       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',\n",
       "       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',\n",
       "       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get all the unique values in the 'Country' column\n",
    "countries = professors['Country'].unique()\n",
    "\n",
    "# sort them alphabetically and then take a closer look\n",
    "countries.sort()\n",
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rough-sender",
   "metadata": {
    "papermill": {
     "duration": 0.016397,
     "end_time": "2021-06-03T16:37:57.182905",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.166508",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Just looking at this, I can see some problems due to inconsistent data entry: ' Germany', and 'germany', for example, or ' New Zealand' and 'New Zealand'.\n",
    "\n",
    "The first thing I'm going to do is make everything lower case (I can change it back at the end if I like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "linear-controversy",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.221039Z",
     "iopub.status.busy": "2021-06-03T16:37:57.220281Z",
     "iopub.status.idle": "2021-06-03T16:37:57.223484Z",
     "shell.execute_reply": "2021-06-03T16:37:57.222831Z"
    },
    "papermill": {
     "duration": 0.026284,
     "end_time": "2021-06-03T16:37:57.223649",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.197365",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# convert to lower case\n",
    "professors['Country'] = professors['Country'].str.lower()\n",
    "# remove trailing white spaces\n",
    "professors['Country'] = professors['Country'].str.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "consistent-granny",
   "metadata": {
    "papermill": {
     "duration": 0.014033,
     "end_time": "2021-06-03T16:37:57.253078",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.239045",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Next we're going to tackle more difficult inconsistencies.\n",
    "\n",
    "# Use fuzzy matching to correct inconsistent data entry\n",
    "\n",
    "Alright, let's take another look at the 'Country' column and see if there's any more data cleaning we need to do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "married-occupation",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.289892Z",
     "iopub.status.busy": "2021-06-03T16:37:57.289169Z",
     "iopub.status.idle": "2021-06-03T16:37:57.293530Z",
     "shell.execute_reply": "2021-06-03T16:37:57.292915Z"
    },
    "papermill": {
     "duration": 0.025909,
     "end_time": "2021-06-03T16:37:57.293693",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.267784",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['australia', 'austria', 'canada', 'china', 'finland', 'france',\n",
       "       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',\n",
       "       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',\n",
       "       'norway', 'pakistan', 'portugal', 'russian federation',\n",
       "       'saudi arabia', 'scotland', 'singapore', 'south korea',\n",
       "       'southkorea', 'spain', 'sweden', 'thailand', 'turkey', 'uk',\n",
       "       'urbana', 'usa', 'usofa'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get all the unique values in the 'Country' column\n",
    "countries = professors['Country'].unique()\n",
    "\n",
    "# sort them alphabetically and then take a closer look\n",
    "countries.sort()\n",
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "frozen-brake",
   "metadata": {
    "papermill": {
     "duration": 0.014568,
     "end_time": "2021-06-03T16:37:57.323461",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.308893",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "It does look like there is another inconsistency: 'southkorea' and 'south korea' should be the same. \n",
    "\n",
    "We're going to use the [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) package to help identify which strings are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach doesn't scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun!)\n",
    "\n",
    "> **Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered \"closer\" to another one the fewer characters you'd need to change if you were transforming one string into another. So \"apple\" and \"snapple\" are two changes away from each other (add \"s\" and \"n\") while \"in\" and \"on\" and one change away (rplace \"i\" with \"o\"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.\n",
    "\n",
    "Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of cities that have the closest distance to \"d.i khan\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "rational-excitement",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.362620Z",
     "iopub.status.busy": "2021-06-03T16:37:57.361589Z",
     "iopub.status.idle": "2021-06-03T16:37:57.366211Z",
     "shell.execute_reply": "2021-06-03T16:37:57.365525Z"
    },
    "papermill": {
     "duration": 0.02748,
     "end_time": "2021-06-03T16:37:57.366374",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.338894",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('south korea', 100),\n",
       " ('southkorea', 48),\n",
       " ('saudi arabia', 43),\n",
       " ('norway', 35),\n",
       " ('austria', 33),\n",
       " ('ireland', 33),\n",
       " ('pakistan', 32),\n",
       " ('portugal', 32),\n",
       " ('scotland', 32),\n",
       " ('australia', 30)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the top 10 closest matches to \"south korea\"\n",
    "matches = fuzzywuzzy.process.extract(\"south korea\", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)\n",
    "\n",
    "# take a look at them\n",
    "matches"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cardiovascular-vocabulary",
   "metadata": {
    "papermill": {
     "duration": 0.015035,
     "end_time": "2021-06-03T16:37:57.397065",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.382030",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can see that two of the items in the cities are very close to \"south korea\": \"south korea\" and \"southkorea\". Let's replace all rows in our \"Country\" column that have a ratio of > 47 with \"south korea\". \n",
    "\n",
    "To do this, I'm going to write a function. (It's a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dried-glucose",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.438979Z",
     "iopub.status.busy": "2021-06-03T16:37:57.437687Z",
     "iopub.status.idle": "2021-06-03T16:37:57.440863Z",
     "shell.execute_reply": "2021-06-03T16:37:57.440012Z"
    },
    "papermill": {
     "duration": 0.028259,
     "end_time": "2021-06-03T16:37:57.441097",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.412838",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# function to replace rows in the provided column of the provided dataframe\n",
    "# that match the provided string above the provided ratio with the provided string\n",
    "def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):\n",
    "    # get a list of unique strings\n",
    "    strings = df[column].unique()\n",
    "    \n",
    "    # get the top 10 closest matches to our input string\n",
    "    matches = fuzzywuzzy.process.extract(string_to_match, strings, \n",
    "                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)\n",
    "\n",
    "    # only get matches with a ratio > 90\n",
    "    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]\n",
    "\n",
    "    # get the rows of all the close matches in our dataframe\n",
    "    rows_with_matches = df[column].isin(close_matches)\n",
    "\n",
    "    # replace all rows with close matches with the input matches \n",
    "    df.loc[rows_with_matches, column] = string_to_match\n",
    "    \n",
    "    # let us know the function's done\n",
    "    print(\"All done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "removed-scholarship",
   "metadata": {
    "papermill": {
     "duration": 0.015468,
     "end_time": "2021-06-03T16:37:57.475371",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.459903",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Now that we have a function, we can put it to the test!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "demonstrated-turkish",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.515456Z",
     "iopub.status.busy": "2021-06-03T16:37:57.514514Z",
     "iopub.status.idle": "2021-06-03T16:37:57.519340Z",
     "shell.execute_reply": "2021-06-03T16:37:57.518777Z"
    },
    "papermill": {
     "duration": 0.028261,
     "end_time": "2021-06-03T16:37:57.519495",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.491234",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All done!\n"
     ]
    }
   ],
   "source": [
    "# use the function we just wrote to replace close matches to \"south korea\" with \"south korea\"\n",
    "replace_matches_in_column(df=professors, column='Country', string_to_match=\"south korea\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "medieval-catering",
   "metadata": {
    "papermill": {
     "duration": 0.015937,
     "end_time": "2021-06-03T16:37:57.551793",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.535856",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "And now let's check the unique values in our \"Country\" column again and make sure we've tidied up \"south korea\" correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "cultural-shade",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-06-03T16:37:57.591721Z",
     "iopub.status.busy": "2021-06-03T16:37:57.591022Z",
     "iopub.status.idle": "2021-06-03T16:37:57.595764Z",
     "shell.execute_reply": "2021-06-03T16:37:57.595118Z"
    },
    "papermill": {
     "duration": 0.02783,
     "end_time": "2021-06-03T16:37:57.595918",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.568088",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['australia', 'austria', 'canada', 'china', 'finland', 'france',\n",
       "       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',\n",
       "       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',\n",
       "       'norway', 'pakistan', 'portugal', 'russian federation',\n",
       "       'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain',\n",
       "       'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get all the unique values in the 'Country' column\n",
    "countries = professors['Country'].unique()\n",
    "\n",
    "# sort them alphabetically and then take a closer look\n",
    "countries.sort()\n",
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "czech-partner",
   "metadata": {
    "papermill": {
     "duration": 0.016837,
     "end_time": "2021-06-03T16:37:57.629830",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.612993",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Excellent! Now we only have \"south korea\" in our dataframe and we didn't have to change anything by hand. \n",
    "\n",
    "# Your turn!\n",
    "\n",
    "Therre are still some inconsistencies in the dataset.  [**Continue cleaning the data**](https://www.kaggle.com/kernels/fork/10824407) in a final exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "laughing-korea",
   "metadata": {
    "papermill": {
     "duration": 0.016185,
     "end_time": "2021-06-03T16:37:57.662743",
     "exception": false,
     "start_time": "2021-06-03T16:37:57.646558",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "---\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/172650) to chat with other Learners.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 8.058355,
   "end_time": "2021-06-03T16:37:58.492921",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-06-03T16:37:50.434566",
   "version": "2.3.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}