When applying machine learning to real-world data, there are a lot of steps involved in the process -- starting with collecting the data and ending with generating predictions. (We work with the seven steps of machine learning, as defined by Yufeng Guo here.)
It all begins with Step 1: Gather the data. In industry, there are important considerations you need to take into account when building a dataset, such as target leakage. When participating in a Kaggle competition, this step is already completed for you.
In the Intro to Machine Learning and the Intermediate Machine Learning courses, you can learn how to:
That leaves Step 3: Select a model. There are a lot of different types of models. Which one should you select for your problem? When you're just getting started, the best option is just to try everything and build your own intuition - there aren't any universally accepted rules. There are also many useful Kaggle notebooks (like this one) where you can see how and when other Kagglers used different models.
Mastering the machine learning process involves a lot of time and practice. While you're still learning, you can turn to automated machine learning (AutoML) tools to generate intelligent predictions.
In this notebook, you'll learn how to use Google Cloud AutoML Tables to automate the machine learning process. While Kaggle has already taken care of the data collection, AutoML Tables will take care of all remaining steps.
AutoML Tables is a paid service. In the exercise that follows this tutorial, we'll show you how to claim $300 of free credits that you can use to train your own models!
We'll work with data from the New York City Taxi Fare Prediction competition. In this competition, we want you to predict the fare amount (inclusive of tolls) for a taxi ride in New York City, given the pickup and dropoff locations, number of passengers, and the pickup date and time.
To do this, we'll use a Python class that calls on AutoML Tables. To use this code, you need only define the following variables:
PROJECT_ID
- The name of your Google Cloud project. All of the work that you'll do in Google Cloud is organized in "projects". BUCKET_NAME
- The name of your Google Cloud storage bucket. In order to work with AutoML, we'll need to create a storage bucket, where we'll upload the Kaggle dataset.DATASET_DISPLAY_NAME
- The name of your dataset. TRAIN_FILEPATH
- The filepath for the training data (train.csv
file) from the competition.TEST_FILEPATH
- The filepath for the test data (test.csv
file) from the competition.TARGET_COLUMN
- The name of the column in your training data that contains the values you'd like to predict.ID_COLUMN
- The name of the column containing IDs.MODEL_DISPLAY_NAME
- The name of your model.TRAIN_BUDGET
- How long you want your model to train (use 1000 for 1 hour, 2000 for 2 hours, and so on).All of these variables will make more sense when you run your own code in the following exercise!
# Save CSV file with first 2 million rows only
import pandas as pd
train_df = pd.read_csv("../input/new-york-city-taxi-fare-prediction/train.csv", nrows = 2_000_000)
train_df.to_csv("train_small.csv", index=False)
PROJECT_ID = 'kaggle-playground-170215'
BUCKET_NAME = 'automl-tutorial-alexis'
DATASET_DISPLAY_NAME = 'taxi_fare_dataset'
TRAIN_FILEPATH = "../working/train_small.csv"
TEST_FILEPATH = "../input/new-york-city-taxi-fare-prediction/test.csv"
TARGET_COLUMN = 'fare_amount'
ID_COLUMN = 'key'
MODEL_DISPLAY_NAME = 'tutorial_model'
TRAIN_BUDGET = 4000
# Import the class defining the wrapper
from automl_tables_wrapper import AutoMLTablesWrapper
# Create an instance of the wrapper
amw = AutoMLTablesWrapper(project_id=PROJECT_ID,
bucket_name=BUCKET_NAME,
dataset_display_name=DATASET_DISPLAY_NAME,
train_filepath=TRAIN_FILEPATH,
test_filepath=TEST_FILEPATH,
target_column=TARGET_COLUMN,
id_column=ID_COLUMN,
model_display_name=MODEL_DISPLAY_NAME,
train_budget=TRAIN_BUDGET)
Preparing clients ... Clients successfully created! GCS bucket found. File train.csv uploaded to train.csv. File test.csv uploaded to test.csv. Dataset found. Set target column. Set columns to nullable. Ready to train model.
Next, we train a model and use it to generate predictions on the test dataset.
# Create and train the model
amw.train_model()
# Get predictions
amw.get_predictions()
Training model ... Finished training model. Getting predictions ... Submission ready for download!
After completing these steps, we have a file that we can submit to the competition! In the code cell below, we load this submission file and view the first several rows.
submission_df = pd.read_csv("../working/submission.csv")
submission_df.head()
key | fare_amount | |
---|---|---|
0 | 2012-11-03 17:11:00.00000069 | 11.240405 |
1 | 2011-06-01 07:37:00.00000036 | 6.535698 |
2 | 2014-04-27 02:57:00.00000012 | 6.091896 |
3 | 2011-12-13 22:00:00.000000189 | 6.235514 |
4 | 2010-08-14 02:13:00.00000026 | 13.023638 |
And how well does it perform? Well, the competition provides a starter notebook with a simple linear model that predicts a fare amount based on the distance between the pickup and dropoff locations. This approach outperforms that notebook, and it ranks better than roughly half of the total submissions to the competition.
Run your own code using AutoML Tables to make a submission to a Kaggle competition!
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.