This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Character encodings tutorial.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

Get our environment set up

The first thing we'll need to do is load in the libraries we'll be using.

1) What are encodings?

You're working with a dataset composed of bytes. Run the code cell below to print a sample entry.

You notice that it doesn't use the standard UTF-8 encoding.

Use the next code cell to create a variable new_entry that changes the encoding from "big5-tw" to "utf-8". new_entry should have the bytes datatype.

2) Reading in files with encoding problems

Use the code cell below to read in this file at path "../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv".

Figure out what the correct encoding should be and read in the file to a DataFrame police_killings.

Feel free to use any additional code cells for supplemental work. To get credit for finishing this question, you'll need to run q2.check() and get a result of Correct.

3) Saving your files with UTF-8 encoding

Save a version of the police killings dataset to CSV with UTF-8 encoding. Your answer will be marked correct after saving this file.

Note: When using the to_csv() method, supply only the name of the file (e.g., "my_file.csv"). This saves the file at the filepath "/kaggle/working/my_file.csv".

(Optional) More practice

Check out this dataset of files in different character encodings. Can you read in all the files with their original encodings and them save them out as UTF-8 files?

If you have a file that's in UTF-8 but has just a couple of weird-looking characters in it, you can try out the ftfy module and see if it helps.

Keep going

In the final lesson, learn how to clean up inconsistent text entries in your dataset.


Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.