This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.
In this exercise, you'll apply what you learned in the Inconsistent data entry tutorial.
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex5 import *
print("Setup Complete")
Setup Complete
The first thing we'll need to do is load in the libraries and dataset we'll be using. We use the same dataset from the tutorial.
# modules we'll use
import pandas as pd
import numpy as np
# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet
# read in all our data
professors = pd.read_csv("../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv")
# set seed for reproducibility
np.random.seed(0)
Next, we'll redo all of the work that we did in the tutorial.
# convert to lower case
professors['Country'] = professors['Country'].str.lower()
# remove trailing white spaces
professors['Country'] = professors['Country'].str.strip()
# get the top 10 closest matches to "south korea"
countries = professors['Country'].unique()
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
# get a list of unique strings
strings = df[column].unique()
# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)
# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match
# let us know the function's done
print("All done!")
replace_matches_in_column(df=professors, column='Country', string_to_match="south korea")
countries = professors['Country'].unique()
All done!
Write code below to take a look at all the unique values in the "Graduated from" column.
graduated_from = professors['Graduated from'].unique()
graduated_from.sort()
graduated_from
array([' Columbia University', ' Delft University of Technology', ' Iowa State University', ' University of Central Florida', ' University of Innsbruck', ' University of Texas at Arlington (UTA)', ' University of Turin', 'Abasyn University', 'Abdul Wali Khan University, Mardan', 'Abdus Salam School of Mathematical Sciences,GC University', 'Agricultural University Peshawar', 'Allama Iqbal Open University', 'Asian Institute of Technology', 'Aston University, Birmingham', 'Australian National University, Caneberra', 'BUKC', 'Bahauddin Zakariya University', 'Bahria University', 'Bahria University,Islamabad', 'Balochistan University of Information Technology, Engineering and Management Sciences', 'Barani Institute of Information Technology', 'Beaconhouse National University', 'Beihang University', 'Beijing Institute of Technology', 'Beijing Institute of Technology Beijing', 'Beijing University of Posts & Telecommunications', 'Biztek Institute Of Business & Technology,Karachi', 'Blekinge Institute of Technology', 'Brock University Canada', 'Brunel University', 'CECOS University of Information Technology and Emerging Sciences,Peshawar', 'COMSATS Institute of Information Technology', 'COMSATS Institute of Information Technology,Islamabad', 'COMSATS Institute of Information Technology,Lahore', 'COMSATS Institute of Information Technology,Vehari', 'COMSATS Institute of Information Technology,Wah Cantt', 'California State University', 'Capital University of Science & Technology', 'Capital University of Science and Technology', 'Carnegie Mellon University, Pittsburgh', 'Centre for Advanced Studies in Engineering', 'Chalmers University of Technology', 'Chinese Academy of Sciences', 'Chosun University', 'City University of Science and Technology', 'Colorado State University', 'Colorado Technical University', 'Columbia University', 'Concordia University,Montreal', 'Coventry University', 'Cranfield University', 'DUET,Karachi', 'DePaul University, Chicago', 'Dresden University Of Technology, Dresden\xa0', 'Eindhoven University of Technology (TU/e)', 'FAST– National University of Computer and Emerging Sciences', 'FAST– National University of Computer and Emerging Sciences,Chiniot-Faisalabad', 'FAST– National University of Computer and Emerging Sciences,Islamabad', 'FAST– National University of Computer and Emerging Sciences,Lahore', 'FAST– National University of Computer and Emerging Sciences,Peshawar', 'Fatima Jinnah Women University, Rawalpindi', 'Fedral Urdu University', 'Florida Atlantic University', 'Foundation University', 'Galilée - Université Paris 13', 'George Mason University', 'George Washington University', 'Georgetown University,DC', 'Ghulam Ishaq Khan Institute of Science and Technology', 'Gomal University', 'Government College University', 'Government College University, Faisalabad', 'Government College University,Faisalabad', 'Graz University of Technology', 'Grenoble', 'Griffith University', 'Griffith University,Nathan Campus', 'Guildford', 'Gwangju Institute of Science and Technology', 'HITEC University,Taxila', 'Hamburg University of Technology', 'Hamdard University', 'Hanyang University, Ansan', 'Harbin Institute of Technology', 'Huazhong University of Science and Technology (HUST), Wuhan', 'IBMS KP Agricultural University Peshawar', 'INRIA Saclay Ile-de-France', 'INSA de Lyon, Rhone', 'IQRA University', 'IQRA University,Islamabad', 'IQRA University,Karachi', 'ISRA University', 'Illinois Institute of Technology', 'Ilmenau University of Technology', 'Imperial College, University of London', 'Information Technology University (ITU)', 'Institute Of Managment Sciences, Peshawar', 'Institute of Business Administration', 'Institute of Business Administration,Karachi', 'Institute of Business Administration,Sukkur', 'Institute of Management Sciences, Peshawar', 'International Islamic University', 'International Islamic University,Islamabad', 'Islamia College University ', 'JKU', 'JNU', 'Jinnah University for Women', 'John Moorse University, Liverpool', 'Jonkoping University', 'KTH Royal Institute of Technology', 'King Abdullah University of Science and Technology', 'Kingston University London', 'Kohat University of Science & Technology, Kohat', 'Kyung Hee University', 'Kyungpook National University', 'Kyushu University,Fukuoka', 'Lahore College for Women University', 'Lahore Leads University', 'Lahore University of Management Sciences', 'Linköping University', 'Liverpool John Moores University', 'London University', 'Loughborough University', 'Manchester Metropolitan University', 'Manchester University', 'Massachusetts Institute of Technology', 'Max Planck Institute for Computer Science', 'Mehran University of Engineering & Technology', 'Mid Sweden University', 'Middle East Technical University', 'Middlesex University', 'Minhaj University Lahore', 'Mohammad Ali Jinnah University', 'Monash University', 'Muroran Institute of Technology,Hokkaido', 'Myongji University', 'NCSU', 'NED University of Engineering And Technology', 'Nancy 2 University', 'Nanyang Tech University', 'National College of Business Administration and Economics', 'National Textile University', 'National University of Modern Languages', 'National University of Modern Languages,Islamabad', 'National University of Sciences and Technology', 'National University of Singapore', 'New York Institute of Technology', 'North Dakota State University', 'Northeastern University,Boston', 'Norwegian University of Science and Technology (NTNU),', 'Nottingham Trent University', 'Oxford Brookes University', 'PAF-Karachi Institute of Economics and Technology', 'Pace University, New York', 'Pakistan Institute of Engineering and Applied Sciences', 'Paris Descartes University', 'Paris Tech University of Eurecom', 'Pir Mehr Ali Shah Arid Agriculture University', 'Pohang University of Science and Technology', 'Politecnico di Milano', 'Politecnico di Torino', 'Pompeu Fabra University Barcelona', 'Preston', 'Punjab University College of Information Technology', 'Purdue University', 'Quaid-e-Awam University of Engineering, Science & Technology', 'Quaid-i-Azam University', 'Queen Mary University of London', 'RWTH Aachen University', 'Razak School of Engineering and Advanced Technology, Universiti Teknologi Malaysia (UTM)', 'Riphah International University', 'Riphah International University,Faisalabad', 'Rutgers State University of New Jersey, NJ', 'SRH Hochschule Heidelberg', 'SSindh Agriculture University', 'Saarland University', 'Sapienza University of Rome', "Sardar Bahadur Khan Women's University", 'Seoul National University', 'Shah Abdul Latif University, Khairpur', 'Shaheed Zulfikar Ali Bhutto Institute of Science and Technology', 'Shaheed Zulfikar Ali Bhutto Institute of Science and Technology,Islamabad', 'Sindh Agriculture University', 'Sindh University', 'Sir Syed University of Engineering and Technology', 'Skolkovo Institute of Science and Technology,\xa0', 'South Asian University', 'Staffordshire University', 'Stanford University', 'State University of New York System', 'Stockholm University', 'Sungkyunkwan University', 'Superior University, Lahore', 'Swansea', 'Swedish University of Agricultural Sciences, Uppsala', 'Swinburne University Of Technology', 'TU Berlin', 'TU Wien', 'Technical University of Braunschweig', 'Technical University of Graz', 'Temple University', 'The Islamia University of Bahawalpur ', 'The Ohio State University', 'The Queens University of Belfast', 'The State University of New Jersey', 'The University of Auckland', 'The University of Birmingham', 'The University of Cambridge', 'The University of Leeds', 'The University of Manchester', 'The University of Queensland', 'The University of Texas at Austin', 'The University of York', 'Tilburg University', 'Tokyo Institute of Technology', 'Tsinghua University', 'United Nations University International Institute for Software Technology (UNU-IIST)', 'Univ of Porto/Univ of Aveiro Portugal/Uni of Minho', "Universite d'Evry Val d'Essonne", 'Universiti Putra Malaysia Putra', 'Universiti Technologi', 'Universiti Teknologi PETRONAS', 'Universiti Tun Hussein Onn Malaysia', 'University Institute of Information Technology', 'University Of Caen', 'University Of Oslo', 'University Of Salford', 'University Of Southern California', 'University Of Waterloo', 'University Paris', 'University of Abertay Dundee', 'University of Agriculture', 'University of Agriculture Faisalabad', 'University of Agriculture, Faisalabad', 'University of Agriculture, Faisalabad\xa0', 'University of Arid Agriculture', 'University of Balochistan', 'University of Bath', 'University of Bayreuth', 'University of BedfordShire', 'University of Bedfordshire', 'University of Bergen', 'University of Birmingham', 'University of Bologna', 'University of Bonn', 'University of Bradford', 'University of Bristol', 'University of British Columbia', 'University of Canterbury', 'University of Central Florida', 'University of Central Missouri', 'University of Central Punjab', 'University of Colorado\xa0', 'University of Dundee', 'University of Engineering & Technology', 'University of Engineering and Technology', 'University of Engineering and Technology,Peshawar', 'University of Engineering and Technology,Taxila', 'University of Essex', 'University of Florida', 'University of Freiburg', 'University of Genova', 'University of Glasgow', 'University of Grenoble', 'University of Gujrat', 'University of Huddersfield', 'University of Illinois', 'University of Innsbruck', 'University of Karachi', 'University of Kent', 'University of Konstanz', 'University of Kuala Lumpur', 'University of Lahore', 'University of Leeds', 'University of Leicester', 'University of Limerick', 'University of Liverpool', 'University of Malaga', 'University of Malaya', 'University of Management and Technology', 'University of Manchester', 'University of Manchester Institute of Science and Technology', 'University of Mississippi\xa0', 'University of New South Wales, Sydney', 'University of Nice, Sophia Antipolis', 'University of Northampton', 'University of Notre Dame Indiana\xa0', 'University of Orleans', 'University of Oviedo', 'University of Paisley', 'University of Paris', 'University of Paris-Est', 'University of Patras', 'University of Peshawar', 'University of Pittsburgh', 'University of Plymouth', 'University of Porto', 'University of Regina', 'University of Rochester', 'University of Rome Tor Vergata', 'University of Saarland', 'University of Salford', 'University of Shanghai for Science and Technology', 'University of South Australia', 'University of South Brittany', 'University of South Florida', 'University of Southampton', 'University of Southern California', 'University of Stirling', 'University of Stuttgart', 'University of Sunderland', 'University of Surrey', 'University of Sussex', 'University of Technology', 'University of Trento', 'University of Turbat', 'University of Ulm', 'University of Vienna', 'University of Virginia', 'University of Wales', 'University of Wales,Aberystwyth', 'University of Westminster', 'University of York', 'University of the Punjab', 'University of the Punjab,Gujranwala', 'University of the West Scotland', 'University of\xa0Liverpool John Moores University', 'Universität Salzburg', 'Université Henri Poincaré, Nancy 1,', 'Université de la Rochelle', 'Universtiy of Karachi', 'Universtiy of Lahore', 'Uppsala University', 'Usman Institute of Technology', 'Usman Institute of Technology (Hamdard University)', 'Vienna University of Technology', 'Virtual University of Pakistan', 'Vrije University, Amsterdam', 'Wayne State University', 'Xiamen university', 'Zhejiang University', '\xa0Boston University', '\xa0Hongik University', '\xa0Nanyang Technological University', '\xa0National University of Sciences and Technology-NIIT', '\xa0University of Bedfordshire', '\xa0University of Bonn', '\xa0University of Missouri, KC', '\xa0University of Windsor', 'Åbo Akademi University,'], dtype=object)
len(professors['Graduated from'].unique())
350
Do you notice any inconsistencies in the data? Can any of the inconsistencies in the data be fixed by removing white spaces at the beginning and end of cells?
Once you have answered these questions, run the code cell below to get credit for your work.
Si, algunos de los problemas pueden ser resueltos eliminando espacios en blanco al inicio y final de los campos. Por otra parte, tenemos algunos carácteres con una codificación erronea \xa0
# Check your answer (Run this code cell to receive credit!)
q1.check()
Correct:
There are inconsistencies that can be fixed by removing white spaces at the beginning and end of cells. For instance, "University of Central Florida" and " University of Central Florida" both appear in the column.
# Line below will give you a hint
# q1.hint()
Convert every entry in the "Graduated from" column in the professors
DataFrame to remove white spaces at the beginning and end of cells.
professors['Graduated from'] = professors['Graduated from'].str.strip()
# Check your answer
q2.check()
Correct
# Lines below will give you a hint or solution code
#q2.hint()
#q2.solution()
In the tutorial, we focused on cleaning up inconsistencies in the "Country" column. Run the code cell below to view the list of unique values that we ended with.
# get all the unique values in the 'City' column
countries = professors['Country'].unique()
# sort them alphabetically and then take a closer look
countries.sort()
countries
array(['australia', 'austria', 'canada', 'china', 'finland', 'france', 'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan', 'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand', 'norway', 'pakistan', 'portugal', 'russian federation', 'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain', 'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'], dtype=object)
Take another look at the "Country" column and see if there's any more data cleaning we need to do.
It looks like 'usa' and 'usofa' should be the same country. Correct the "Country" column in the dataframe so that 'usofa' appears instead as 'usa'.
Use the most recent version of the DataFrame (with the whitespaces at the beginning and end of cells removed) from question 2.
# my first approach
# professors['Country'] = professors['Country'].replace('usofa', 'usa')
matches = fuzzywuzzy.process.extract("usa", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches
[('usa', 100), ('usofa', 75), ('austria', 60), ('australia', 50), ('spain', 50), ('urbana', 44), ('uk', 40), ('malaysia', 36), ('pakistan', 36), ('portugal', 36)]
replace_matches_in_column(df=professors, column='Country', string_to_match="usa", min_ratio=70)
# Check your answer
q3.check()
All done!
Correct
# Lines below will give you a hint or solution code
#q3.hint()
q3.solution()
Solution:
matches = fuzzywuzzy.process.extract("usa", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
replace_matches_in_column(df=professors, column='Country', string_to_match="usa", min_ratio=70)
Congratulations for completing the Data Cleaning course on Kaggle Learn!
To practice your new skills, you're encouraged to download and investigate some of Kaggle's Datasets.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.