{ "cells": [ { "cell_type": "markdown", "id": "automatic-outline", "metadata": { "papermill": { "duration": 0.014651, "end_time": "2021-06-03T16:37:56.863936", "exception": false, "start_time": "2021-06-03T16:37:56.849285", "status": "completed" }, "tags": [] }, "source": [ "In this notebook, we're going to learn how to clean up inconsistent text entries.\n", "\n", "Let's get started!" ] }, { "cell_type": "markdown", "id": "listed-assistant", "metadata": { "papermill": { "duration": 0.01258, "end_time": "2021-06-03T16:37:56.889851", "exception": false, "start_time": "2021-06-03T16:37:56.877271", "status": "completed" }, "tags": [] }, "source": [ "# Get our environment set up\n", "\n", "The first thing we'll need to do is load in the libraries and dataset we'll be using. " ] }, { "cell_type": "code", "execution_count": 1, "id": "ideal-fountain", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:56.920562Z", "iopub.status.busy": "2021-06-03T16:37:56.919322Z", "iopub.status.idle": "2021-06-03T16:37:56.982421Z", "shell.execute_reply": "2021-06-03T16:37:56.983042Z" }, "papermill": { "duration": 0.080465, "end_time": "2021-06-03T16:37:56.983410", "exception": false, "start_time": "2021-06-03T16:37:56.902945", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# modules we'll use\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# helpful modules\n", "import fuzzywuzzy\n", "from fuzzywuzzy import process\n", "import chardet\n", "\n", "# read in all our data\n", "professors = pd.read_csv(\"../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv\")\n", "\n", "# set seed for reproducibility\n", "np.random.seed(0)" ] }, { "cell_type": "markdown", "id": "modern-florence", "metadata": { "papermill": { "duration": 0.013841, "end_time": "2021-06-03T16:37:57.010919", "exception": false, "start_time": "2021-06-03T16:37:56.997078", "status": "completed" }, "tags": [] }, "source": [ "# Do some preliminary text pre-processing\n", "\n", "We'll begin by taking a quick look at the first few rows of the data." ] }, { "cell_type": "code", "execution_count": 2, "id": "heavy-album", "metadata": { "execution": { "iopub.execute_input": "2021-06-03T16:37:57.041684Z", "iopub.status.busy": "2021-06-03T16:37:57.040937Z", "iopub.status.idle": "2021-06-03T16:37:57.078378Z", "shell.execute_reply": "2021-06-03T16:37:57.079001Z" }, "papermill": { "duration": 0.054858, "end_time": "2021-06-03T16:37:57.079207", "exception": false, "start_time": "2021-06-03T16:37:57.024349", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "S# | \n", "Teacher Name | \n", "University Currently Teaching | \n", "Department | \n", "Province University Located | \n", "Designation | \n", "Terminal Degree | \n", "Graduated from | \n", "Country | \n", "Year | \n", "Area of Specialization/Research Interests | \n", "Other Information | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2 | \n", "3 | \n", "Dr. Abdul Basit | \n", "University of Balochistan | \n", "Computer Science & IT | \n", "Balochistan | \n", "Assistant Professor | \n", "PhD | \n", "Asian Institute of Technology | \n", "Thailand | \n", "NaN | \n", "Software Engineering & DBMS | \n", "NaN | \n", "
1 | \n", "4 | \n", "5 | \n", "Dr. Waheed Noor | \n", "University of Balochistan | \n", "Computer Science & IT | \n", "Balochistan | \n", "Assistant Professor | \n", "PhD | \n", "Asian Institute of Technology | \n", "Thailand | \n", "NaN | \n", "DBMS | \n", "NaN | \n", "
2 | \n", "5 | \n", "6 | \n", "Dr. Junaid Baber | \n", "University of Balochistan | \n", "Computer Science & IT | \n", "Balochistan | \n", "Assistant Professor | \n", "PhD | \n", "Asian Institute of Technology | \n", "Thailand | \n", "NaN | \n", "Information processing, Multimedia mining | \n", "NaN | \n", "
3 | \n", "6 | \n", "7 | \n", "Dr. Maheen Bakhtyar | \n", "University of Balochistan | \n", "Computer Science & IT | \n", "Balochistan | \n", "Assistant Professor | \n", "PhD | \n", "Asian Institute of Technology | \n", "Thailand | \n", "NaN | \n", "NLP, Information Retrieval, Question Answering... | \n", "NaN | \n", "
4 | \n", "24 | \n", "25 | \n", "Samina Azim | \n", "Sardar Bahadur Khan Women's University | \n", "Computer Science | \n", "Balochistan | \n", "Lecturer | \n", "BS | \n", "Balochistan University of Information Technolo... | \n", "Pakistan | \n", "2005.0 | \n", "VLSI Electronics DLD Database | \n", "NaN | \n", "