This notebook is an exercise in the Introduction to Machine Learning course. You can reference the tutorial at this link.
You've built a model. In this exercise you will test how good your model is.
Run the cell below to set up your coding environment where the previous exercise left off.
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]
# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())
predictions = iowa_model.predict(X)
print("Mean absolute error:", mean_absolute_error(y, predictions))
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex4 import *
print("Setup Complete")
First in-sample predictions: [208500. 181500. 223500. 140000. 250000.] Actual target values for those homes: [208500, 181500, 223500, 140000, 250000] Mean absolute error: 62.35433789954339 Setup Complete
# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split
# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Check your answer
step_1.check()
Correct
# The lines below will show you a hint or the solution.
# step_1.hint()
# step_1.solution()
Create a DecisionTreeRegressor
model and fit it to the relevant data.
Set random_state
to 1 again when creating the model.
# You imported DecisionTreeRegressor in your last exercise
# and that code has been copied to the setup code above. So, no need to
# import it again
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)
# Check your answer
step_2.check()
[186500. 184000. 130000. 92000. 164500. 220000. 335000. 144152. 215000. 262000.] [186500. 184000. 130000. 92000. 164500. 220000. 335000. 144152. 215000. 262000.]
Correct
# step_2.hint()
# step_2.solution()
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
# Check your answer
step_3.check()
Correct
# step_3.hint()
# step_3.solution()
Inspect your predictions and actual values from validation data.
# print the top few validation predictions
print(val_predictions[:5])
# print the top few actual prices from validation data
print(y[:5].to_list())
[186500. 184000. 130000. 92000. 164500.] [208500, 181500, 223500, 140000, 250000]
What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).
Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)
# uncomment following line to see the validation_mae
print(val_mae)
# Check your answer
step_4.check()
29652.931506849316
Correct
# step_4.hint()
# step_4.solution()
Is that MAE good? There isn't a general rule for what values are good that applies across applications. But you'll see how to use (and improve) this number in the next step.
You are ready for Underfitting and Overfitting.
Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.