Introduction

When applying machine learning to real-world data, there are a lot of steps involved in the process -- starting with collecting the data and ending with generating predictions. (We work with the seven steps of machine learning, as defined by Yufeng Guo here.)

It all begins with Step 1: Gather the data. In industry, there are important considerations you need to take into account when building a dataset, such as target leakage. When participating in a Kaggle competition, this step is already completed for you.

In the Intro to Machine Learning and the Intermediate Machine Learning courses, you can learn how to:

That leaves Step 3: Select a model. There are a lot of different types of models. Which one should you select for your problem? When you're just getting started, the best option is just to try everything and build your own intuition - there aren't any universally accepted rules. There are also many useful Kaggle notebooks (like this one) where you can see how and when other Kagglers used different models.

Mastering the machine learning process involves a lot of time and practice. While you're still learning, you can turn to automated machine learning (AutoML) tools to generate intelligent predictions.

Automated machine learning (AutoML)

In this notebook, you'll learn how to use Google Cloud AutoML Tables to automate the machine learning process. While Kaggle has already taken care of the data collection, AutoML Tables will take care of all remaining steps.

AutoML Tables is a paid service. In the exercise that follows this tutorial, we'll show you how to claim $300 of free credits that you can use to train your own models!

Note: This lesson is optional. It is not required to complete the Intro to Machine Learning course.

Code

We'll work with data from the New York City Taxi Fare Prediction competition. In this competition, we want you to predict the fare amount (inclusive of tolls) for a taxi ride in New York City, given the pickup and dropoff locations, number of passengers, and the pickup date and time.

To do this, we'll use a Python class that calls on AutoML Tables. To use this code, you need only define the following variables:

All of these variables will make more sense when you run your own code in the following exercise!

Next, we train a model and use it to generate predictions on the test dataset.

After completing these steps, we have a file that we can submit to the competition! In the code cell below, we load this submission file and view the first several rows.

And how well does it perform? Well, the competition provides a starter notebook with a simple linear model that predicts a fare amount based on the distance between the pickup and dropoff locations. This approach outperforms that notebook, and it ranks better than roughly half of the total submissions to the competition.

Keep going

Run your own code using AutoML Tables to make a submission to a Kaggle competition!


Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.