{ "cells": [ { "cell_type": "markdown", "id": "automatic-outline", "metadata": { "papermill": { "duration": 0.014651, "end_time": "2021-06-03T16:37:56.863936", "exception": false, "start_time": "2021-06-03T16:37:56.849285", "status": "completed" }, "tags": [] }, "source": [ "In this notebook, we're going to learn how to clean up inconsistent text entries.\n", "\n", "Let's get started!" ] }, { "cell_type": "markdown", "id": "listed-assistant", "metadata": { "papermill": { "duration": 0.01258, "end_time": "2021-06-03T16:37:56.889851", "exception": false, "start_time": "2021-06-03T16:37:56.877271", "status": "completed" }, "tags": [] }, "source": [ "# Get our environment set up\n", "\n", "The first thing we'll need to do is load in the libraries and dataset we'll be using. " ] }, { "cell_type": "code", "execution_count": 1, "id": "ideal-fountain", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:56.920562Z", "iopub.status.busy": "2021-06-03T16:37:56.919322Z", "iopub.status.idle": "2021-06-03T16:37:56.982421Z", "shell.execute_reply": "2021-06-03T16:37:56.983042Z" }, "papermill": { "duration": 0.080465, "end_time": "2021-06-03T16:37:56.983410", "exception": false, "start_time": "2021-06-03T16:37:56.902945", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# modules we'll use\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# helpful modules\n", "import fuzzywuzzy\n", "from fuzzywuzzy import process\n", "import chardet\n", "\n", "# read in all our data\n", "professors = pd.read_csv(\"../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv\")\n", "\n", "# set seed for reproducibility\n", "np.random.seed(0)" ] }, { "cell_type": "markdown", "id": "modern-florence", "metadata": { "papermill": { "duration": 0.013841, "end_time": "2021-06-03T16:37:57.010919", "exception": false, "start_time": "2021-06-03T16:37:56.997078", "status": "completed" }, "tags": [] }, "source": [ "# Do some preliminary text pre-processing\n", "\n", "We'll begin by taking a quick look at the first few rows of the data." ] }, { "cell_type": "code", "execution_count": 2, "id": "heavy-album", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.041684Z", "iopub.status.busy": "2021-06-03T16:37:57.040937Z", "iopub.status.idle": "2021-06-03T16:37:57.078378Z", "shell.execute_reply": "2021-06-03T16:37:57.079001Z" }, "papermill": { "duration": 0.054858, "end_time": "2021-06-03T16:37:57.079207", "exception": false, "start_time": "2021-06-03T16:37:57.024349", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0S#Teacher NameUniversity Currently TeachingDepartmentProvince University LocatedDesignationTerminal DegreeGraduated fromCountryYearArea of Specialization/Research InterestsOther Information
023Dr. Abdul BasitUniversity of BalochistanComputer Science & ITBalochistanAssistant ProfessorPhDAsian Institute of TechnologyThailandNaNSoftware Engineering & DBMSNaN
145Dr. Waheed NoorUniversity of BalochistanComputer Science & ITBalochistanAssistant ProfessorPhDAsian Institute of TechnologyThailandNaNDBMSNaN
256Dr. Junaid BaberUniversity of BalochistanComputer Science & ITBalochistanAssistant ProfessorPhDAsian Institute of TechnologyThailandNaNInformation processing, Multimedia miningNaN
367Dr. Maheen BakhtyarUniversity of BalochistanComputer Science & ITBalochistanAssistant ProfessorPhDAsian Institute of TechnologyThailandNaNNLP, Information Retrieval, Question Answering...NaN
42425Samina AzimSardar Bahadur Khan Women's UniversityComputer ScienceBalochistanLecturerBSBalochistan University of Information Technolo...Pakistan2005.0VLSI Electronics DLD DatabaseNaN
\n", "
" ], "text/plain": [ " Unnamed: 0 S# Teacher Name \\\n", "0 2 3 Dr. Abdul Basit \n", "1 4 5 Dr. Waheed Noor \n", "2 5 6 Dr. Junaid Baber \n", "3 6 7 Dr. Maheen Bakhtyar \n", "4 24 25 Samina Azim \n", "\n", " University Currently Teaching Department \\\n", "0 University of Balochistan Computer Science & IT \n", "1 University of Balochistan Computer Science & IT \n", "2 University of Balochistan Computer Science & IT \n", "3 University of Balochistan Computer Science & IT \n", "4 Sardar Bahadur Khan Women's University Computer Science \n", "\n", " Province University Located Designation Terminal Degree \\\n", "0 Balochistan Assistant Professor PhD \n", "1 Balochistan Assistant Professor PhD \n", "2 Balochistan Assistant Professor PhD \n", "3 Balochistan Assistant Professor PhD \n", "4 Balochistan Lecturer BS \n", "\n", " Graduated from Country Year \\\n", "0 Asian Institute of Technology Thailand NaN \n", "1 Asian Institute of Technology Thailand NaN \n", "2 Asian Institute of Technology Thailand NaN \n", "3 Asian Institute of Technology Thailand NaN \n", "4 Balochistan University of Information Technolo... Pakistan 2005.0 \n", "\n", " Area of Specialization/Research Interests Other Information \n", "0 Software Engineering & DBMS NaN \n", "1 DBMS NaN \n", "2 Information processing, Multimedia mining NaN \n", "3 NLP, Information Retrieval, Question Answering... NaN \n", "4 VLSI Electronics DLD Database NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "professors.head()" ] }, { "cell_type": "markdown", "id": "sensitive-turkish", "metadata": { "papermill": { "duration": 0.013814, "end_time": "2021-06-03T16:37:57.107467", "exception": false, "start_time": "2021-06-03T16:37:57.093653", "status": "completed" }, "tags": [] }, "source": [ "Say we're interested in cleaning up the \"Country\" column to make sure there's no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There's a more efficient way to do this, though!" ] }, { "cell_type": "code", "execution_count": 3, "id": "fitted-relief", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.140832Z", "iopub.status.busy": "2021-06-03T16:37:57.140150Z", "iopub.status.idle": "2021-06-03T16:37:57.151629Z", "shell.execute_reply": "2021-06-03T16:37:57.151025Z" }, "papermill": { "duration": 0.030226, "end_time": "2021-06-03T16:37:57.151814", "exception": false, "start_time": "2021-06-03T16:37:57.121588", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',\n", " 'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',\n", " 'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',\n", " 'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',\n", " 'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',\n", " 'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',\n", " 'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],\n", " dtype=object)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get all the unique values in the 'Country' column\n", "countries = professors['Country'].unique()\n", "\n", "# sort them alphabetically and then take a closer look\n", "countries.sort()\n", "countries" ] }, { "cell_type": "markdown", "id": "rough-sender", "metadata": { "papermill": { "duration": 0.016397, "end_time": "2021-06-03T16:37:57.182905", "exception": false, "start_time": "2021-06-03T16:37:57.166508", "status": "completed" }, "tags": [] }, "source": [ "Just looking at this, I can see some problems due to inconsistent data entry: ' Germany', and 'germany', for example, or ' New Zealand' and 'New Zealand'.\n", "\n", "The first thing I'm going to do is make everything lower case (I can change it back at the end if I like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this." ] }, { "cell_type": "code", "execution_count": 4, "id": "linear-controversy", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.221039Z", "iopub.status.busy": "2021-06-03T16:37:57.220281Z", "iopub.status.idle": "2021-06-03T16:37:57.223484Z", "shell.execute_reply": "2021-06-03T16:37:57.222831Z" }, "papermill": { "duration": 0.026284, "end_time": "2021-06-03T16:37:57.223649", "exception": false, "start_time": "2021-06-03T16:37:57.197365", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# convert to lower case\n", "professors['Country'] = professors['Country'].str.lower()\n", "# remove trailing white spaces\n", "professors['Country'] = professors['Country'].str.strip()" ] }, { "cell_type": "markdown", "id": "consistent-granny", "metadata": { "papermill": { "duration": 0.014033, "end_time": "2021-06-03T16:37:57.253078", "exception": false, "start_time": "2021-06-03T16:37:57.239045", "status": "completed" }, "tags": [] }, "source": [ "Next we're going to tackle more difficult inconsistencies.\n", "\n", "# Use fuzzy matching to correct inconsistent data entry\n", "\n", "Alright, let's take another look at the 'Country' column and see if there's any more data cleaning we need to do." ] }, { "cell_type": "code", "execution_count": 5, "id": "married-occupation", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.289892Z", "iopub.status.busy": "2021-06-03T16:37:57.289169Z", "iopub.status.idle": "2021-06-03T16:37:57.293530Z", "shell.execute_reply": "2021-06-03T16:37:57.292915Z" }, "papermill": { "duration": 0.025909, "end_time": "2021-06-03T16:37:57.293693", "exception": false, "start_time": "2021-06-03T16:37:57.267784", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array(['australia', 'austria', 'canada', 'china', 'finland', 'france',\n", " 'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',\n", " 'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',\n", " 'norway', 'pakistan', 'portugal', 'russian federation',\n", " 'saudi arabia', 'scotland', 'singapore', 'south korea',\n", " 'southkorea', 'spain', 'sweden', 'thailand', 'turkey', 'uk',\n", " 'urbana', 'usa', 'usofa'], dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get all the unique values in the 'Country' column\n", "countries = professors['Country'].unique()\n", "\n", "# sort them alphabetically and then take a closer look\n", "countries.sort()\n", "countries" ] }, { "cell_type": "markdown", "id": "frozen-brake", "metadata": { "papermill": { "duration": 0.014568, "end_time": "2021-06-03T16:37:57.323461", "exception": false, "start_time": "2021-06-03T16:37:57.308893", "status": "completed" }, "tags": [] }, "source": [ "It does look like there is another inconsistency: 'southkorea' and 'south korea' should be the same. \n", "\n", "We're going to use the [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) package to help identify which strings are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach doesn't scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun!)\n", "\n", "> **Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered \"closer\" to another one the fewer characters you'd need to change if you were transforming one string into another. So \"apple\" and \"snapple\" are two changes away from each other (add \"s\" and \"n\") while \"in\" and \"on\" and one change away (rplace \"i\" with \"o\"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.\n", "\n", "Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of cities that have the closest distance to \"d.i khan\"." ] }, { "cell_type": "code", "execution_count": 6, "id": "rational-excitement", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.362620Z", "iopub.status.busy": "2021-06-03T16:37:57.361589Z", "iopub.status.idle": "2021-06-03T16:37:57.366211Z", "shell.execute_reply": "2021-06-03T16:37:57.365525Z" }, "papermill": { "duration": 0.02748, "end_time": "2021-06-03T16:37:57.366374", "exception": false, "start_time": "2021-06-03T16:37:57.338894", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[('south korea', 100),\n", " ('southkorea', 48),\n", " ('saudi arabia', 43),\n", " ('norway', 35),\n", " ('austria', 33),\n", " ('ireland', 33),\n", " ('pakistan', 32),\n", " ('portugal', 32),\n", " ('scotland', 32),\n", " ('australia', 30)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get the top 10 closest matches to \"south korea\"\n", "matches = fuzzywuzzy.process.extract(\"south korea\", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)\n", "\n", "# take a look at them\n", "matches" ] }, { "cell_type": "markdown", "id": "cardiovascular-vocabulary", "metadata": { "papermill": { "duration": 0.015035, "end_time": "2021-06-03T16:37:57.397065", "exception": false, "start_time": "2021-06-03T16:37:57.382030", "status": "completed" }, "tags": [] }, "source": [ "We can see that two of the items in the cities are very close to \"south korea\": \"south korea\" and \"southkorea\". Let's replace all rows in our \"Country\" column that have a ratio of > 47 with \"south korea\". \n", "\n", "To do this, I'm going to write a function. (It's a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)" ] }, { "cell_type": "code", "execution_count": 7, "id": "dried-glucose", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.438979Z", "iopub.status.busy": "2021-06-03T16:37:57.437687Z", "iopub.status.idle": "2021-06-03T16:37:57.440863Z", "shell.execute_reply": "2021-06-03T16:37:57.440012Z" }, "papermill": { "duration": 0.028259, "end_time": "2021-06-03T16:37:57.441097", "exception": false, "start_time": "2021-06-03T16:37:57.412838", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# function to replace rows in the provided column of the provided dataframe\n", "# that match the provided string above the provided ratio with the provided string\n", "def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):\n", " # get a list of unique strings\n", " strings = df[column].unique()\n", " \n", " # get the top 10 closest matches to our input string\n", " matches = fuzzywuzzy.process.extract(string_to_match, strings, \n", " limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)\n", "\n", " # only get matches with a ratio > 90\n", " close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]\n", "\n", " # get the rows of all the close matches in our dataframe\n", " rows_with_matches = df[column].isin(close_matches)\n", "\n", " # replace all rows with close matches with the input matches \n", " df.loc[rows_with_matches, column] = string_to_match\n", " \n", " # let us know the function's done\n", " print(\"All done!\")" ] }, { "cell_type": "markdown", "id": "removed-scholarship", "metadata": { "papermill": { "duration": 0.015468, "end_time": "2021-06-03T16:37:57.475371", "exception": false, "start_time": "2021-06-03T16:37:57.459903", "status": "completed" }, "tags": [] }, "source": [ "Now that we have a function, we can put it to the test!" ] }, { "cell_type": "code", "execution_count": 8, "id": "demonstrated-turkish", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.515456Z", "iopub.status.busy": "2021-06-03T16:37:57.514514Z", "iopub.status.idle": "2021-06-03T16:37:57.519340Z", "shell.execute_reply": "2021-06-03T16:37:57.518777Z" }, "papermill": { "duration": 0.028261, "end_time": "2021-06-03T16:37:57.519495", "exception": false, "start_time": "2021-06-03T16:37:57.491234", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All done!\n" ] } ], "source": [ "# use the function we just wrote to replace close matches to \"south korea\" with \"south korea\"\n", "replace_matches_in_column(df=professors, column='Country', string_to_match=\"south korea\")" ] }, { "cell_type": "markdown", "id": "medieval-catering", "metadata": { "papermill": { "duration": 0.015937, "end_time": "2021-06-03T16:37:57.551793", "exception": false, "start_time": "2021-06-03T16:37:57.535856", "status": "completed" }, "tags": [] }, "source": [ "And now let's check the unique values in our \"Country\" column again and make sure we've tidied up \"south korea\" correctly." ] }, { "cell_type": "code", "execution_count": 9, "id": "cultural-shade", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.591721Z", "iopub.status.busy": "2021-06-03T16:37:57.591022Z", "iopub.status.idle": "2021-06-03T16:37:57.595764Z", "shell.execute_reply": "2021-06-03T16:37:57.595118Z" }, "papermill": { "duration": 0.02783, "end_time": "2021-06-03T16:37:57.595918", "exception": false, "start_time": "2021-06-03T16:37:57.568088", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array(['australia', 'austria', 'canada', 'china', 'finland', 'france',\n", " 'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',\n", " 'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',\n", " 'norway', 'pakistan', 'portugal', 'russian federation',\n", " 'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain',\n", " 'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'],\n", " dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get all the unique values in the 'Country' column\n", "countries = professors['Country'].unique()\n", "\n", "# sort them alphabetically and then take a closer look\n", "countries.sort()\n", "countries" ] }, { "cell_type": "markdown", "id": "czech-partner", "metadata": { "papermill": { "duration": 0.016837, "end_time": "2021-06-03T16:37:57.629830", "exception": false, "start_time": "2021-06-03T16:37:57.612993", "status": "completed" }, "tags": [] }, "source": [ "Excellent! Now we only have \"south korea\" in our dataframe and we didn't have to change anything by hand. \n", "\n", "# Your turn!\n", "\n", "Therre are still some inconsistencies in the dataset. [**Continue cleaning the data**](https://www.kaggle.com/kernels/fork/10824407) in a final exercise." ] }, { "cell_type": "markdown", "id": "laughing-korea", "metadata": { "papermill": { "duration": 0.016185, "end_time": "2021-06-03T16:37:57.662743", "exception": false, "start_time": "2021-06-03T16:37:57.646558", "status": "completed" }, "tags": [] }, "source": [ "---\n", "\n", "\n", "\n", "\n", "*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/172650) to chat with other Learners.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "papermill": { "default_parameters": {}, "duration": 8.058355, "end_time": "2021-06-03T16:37:58.492921", "environment_variables": {}, "exception": null, "input_path": "__notebook__.ipynb", "output_path": "__notebook__.ipynb", "parameters": {}, "start_time": "2021-06-03T16:37:50.434566", "version": "2.3.2" } }, "nbformat": 4, "nbformat_minor": 5 }