Introduction¶

When applying machine learning to real-world data, there are a lot of steps involved in the process -- starting with collecting the data and ending with generating predictions. (We work with the seven steps of machine learning, as defined by Yufeng Guo here.)

It all begins with Step 1: Gather the data. In industry, there are important considerations you need to take into account when building a dataset, such as target leakage. When participating in a Kaggle competition, this step is already completed for you.

In the Intro to Machine Learning and the Intermediate Machine Learning courses, you can learn how to:

Step 2: Prepare the data - Deal with missing values and categorical data. (Feature engineering is covered in a separate course.)
Step 4: Train the model - Fit decision trees and random forests to patterns in training data.
Step 5: Evaluate the model - Use a validation set to assess how well a trained model performs on unseen data.
Step 6: Tune parameters - Tune parameters to get better performance from XGBoost models.
Step 7: Get predictions - Generate predictions with a trained model and submit your results to a Kaggle competition.

That leaves Step 3: Select a model. There are a lot of different types of models. Which one should you select for your problem? When you're just getting started, the best option is just to try everything and build your own intuition - there aren't any universally accepted rules. There are also many useful Kaggle notebooks (like this one) where you can see how and when other Kagglers used different models.

Mastering the machine learning process involves a lot of time and practice. While you're still learning, you can turn to automated machine learning (AutoML) tools to generate intelligent predictions.

Automated machine learning (AutoML)¶

In this notebook, you'll learn how to use Google Cloud AutoML Tables to automate the machine learning process. While Kaggle has already taken care of the data collection, AutoML Tables will take care of all remaining steps.

AutoML Tables is a paid service. In the exercise that follows this tutorial, we'll show you how to claim $300 of free credits that you can use to train your own models!

Note: This lesson is optional. It is not required to complete the Intro to Machine Learning course.

Code¶

We'll work with data from the New York City Taxi Fare Prediction competition. In this competition, we want you to predict the fare amount (inclusive of tolls) for a taxi ride in New York City, given the pickup and dropoff locations, number of passengers, and the pickup date and time.

To do this, we'll use a Python class that calls on AutoML Tables. To use this code, you need only define the following variables:

PROJECT_ID - The name of your Google Cloud project. All of the work that you'll do in Google Cloud is organized in "projects".
BUCKET_NAME - The name of your Google Cloud storage bucket. In order to work with AutoML, we'll need to create a storage bucket, where we'll upload the Kaggle dataset.
DATASET_DISPLAY_NAME - The name of your dataset.
TRAIN_FILEPATH - The filepath for the training data (train.csv file) from the competition.
TEST_FILEPATH - The filepath for the test data (test.csv file) from the competition.
TARGET_COLUMN - The name of the column in your training data that contains the values you'd like to predict.
ID_COLUMN - The name of the column containing IDs.
MODEL_DISPLAY_NAME - The name of your model.
TRAIN_BUDGET - How long you want your model to train (use 1000 for 1 hour, 2000 for 2 hours, and so on).

All of these variables will make more sense when you run your own code in the following exercise!

In [2]:

PROJECT_ID = 'kaggle-playground-170215'
BUCKET_NAME = 'automl-tutorial-alexis'

DATASET_DISPLAY_NAME = 'taxi_fare_dataset'
TRAIN_FILEPATH = "../working/train_small.csv" 
TEST_FILEPATH = "../input/new-york-city-taxi-fare-prediction/test.csv"

TARGET_COLUMN = 'fare_amount'
ID_COLUMN = 'key'

MODEL_DISPLAY_NAME = 'tutorial_model'
TRAIN_BUDGET = 4000

# Import the class defining the wrapper
from automl_tables_wrapper import AutoMLTablesWrapper

# Create an instance of the wrapper
amw = AutoMLTablesWrapper(project_id=PROJECT_ID,
                          bucket_name=BUCKET_NAME,
                          dataset_display_name=DATASET_DISPLAY_NAME,
                          train_filepath=TRAIN_FILEPATH,
                          test_filepath=TEST_FILEPATH,
                          target_column=TARGET_COLUMN,
                          id_column=ID_COLUMN,
                          model_display_name=MODEL_DISPLAY_NAME,
                          train_budget=TRAIN_BUDGET)

Preparing clients ...
Clients successfully created!
GCS bucket found.
File train.csv uploaded to train.csv.
File test.csv uploaded to test.csv.
Dataset found.
Set target column.
Set columns to nullable.
Ready to train model.

Next, we train a model and use it to generate predictions on the test dataset.

After completing these steps, we have a file that we can submit to the competition! In the code cell below, we load this submission file and view the first several rows.

	key	fare_amount
0	2012-11-03 17:11:00.00000069	11.240405
1	2011-06-01 07:37:00.00000036	6.535698
2	2014-04-27 02:57:00.00000012	6.091896
3	2011-12-13 22:00:00.000000189	6.235514
4	2010-08-14 02:13:00.00000026	13.023638

And how well does it perform? Well, the competition provides a starter notebook with a simple linear model that predicts a fare amount based on the distance between the pickup and dropoff locations. This approach outperforms that notebook, and it ranks better than roughly half of the total submissions to the competition.

Keep going¶

Run your own code using AutoML Tables to make a submission to a Kaggle competition!