Created using: PyCaret 2.2
Date Updated: November 11, 2020
Welcome to the Binary Classification Tutorial (CLF101) - Level Beginner. This tutorial assumes that you
are new to PyCaret and looking to get started with Binary Classification using the
pycaret.classification
Module.
In this tutorial we will learn:
Read Time : Approx. 30 Minutes
The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google colab, run the following code at top of your notebook to
display interactive visuals.
from pycaret.utils import enable_colab
enable_colab()
Binary classification is a supervised machine learning technique where the goal is to predict categorical class labels which are discrete and unoredered such as Pass/Fail, Positive/Negative, Default/Not-Default etc. A few real world use cases for classification are listed below:
PyCaret's classification module (pycaret.classification
) is a supervised machine learning
module which is used for classifying the elements into a binary group based on various techniques and
algorithms. Some common use cases of classification problems include predicting customer default (yes or
no), customer churn (customer will leave or stay), disease found (positive or negative).
The PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all.
For this tutorial we will use a dataset from UCI called Default of Credit Card Clients Dataset. This dataset contains information on default payments, demographic factors, credit data, payment history, and billing statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features. Short descriptions of each column are as follows:
Target Column
Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
The original dataset and data dictionary can be found here.
You can download the data from the original source found here
and load it using pandas (Learn
How) or you can use PyCaret's data respository to load the data using the
get_data()
function (This will require an internet connection).
from pycaret.datasets import get_data
dataset = get_data('credit')
#check the shape of data
dataset.shape
In order to demonstrate the predict_model()
function on unseen data, a sample of 1200
records has been withheld from the original dataset to be used for predictions. This should not be
confused with a train/test split as this particular split is performed to simulate a real life scenario.
Another way to think about this is that these 1200 records are not available at the time when the machine
learning experiment was performed.
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
The setup()
function initializes the environment in pycaret and creates the transformation
pipeline to prepare the data for modeling and deployment. setup()
must be called before
executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the
name of the target column. All other parameters are optional and are used to customize the pre-processing
pipeline (we will see them in later tutorials).
When setup()
is executed, PyCaret's inference algorithm will automatically infer the data
types for all features based on certain properties. The data type should be inferred correctly but this is
not always the case. To account for this, PyCaret displays a table containing the features and their
inferred data types after setup()
is executed. If all of the data types are correctly
identified enter
can be pressed to continue or quit
can be typed to end the
expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it
automatically performs a few pre-processing tasks which are imperative to any machine learning experiment.
These tasks are performed differently for each data type which means it is very important for them to be
correctly configured.
In later tutorials we will learn how to overwrite PyCaret's infered data type using the
numeric_features
and categorical_features
parameters in setup()
.
from pycaret.classification import *
exp_clf101 = setup(data = data, target = 'default', session_id=123)
Once the setup has been succesfully executed it prints the information grid which contains several
important pieces of information. Most of the information is related to the pre-processing pipeline which
is constructed when setup()
is executed. The majority of these features are out of scope for
the purposes of this tutorial however a few important things to note at this stage include:
session_id
is passed, a random number is automatically generated
that is distributed to all functions. In this experiment, the session_id
is set as
123
for later reproducibility.train_size
parameter in setup. Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing
value imputation (in this case there are no missing values in the training data, but we still need
imputers for unseen data), categorical encoding etc. Most of the parameters in setup()
are
optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this
tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater
detail.
Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC accross the folds (10 by default) along with training times.
best_model = compare_models()
Two simple words of code (not even a line) have trained and evaluated over 15
models using cross validation. The score grid printed above highlights the highest performing metric for
comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be
changed by passing the sort
parameter. For example
compare_models(sort = 'Recall')
will sort the grid by Recall instead of Accuracy. If you want
to change the fold parameter from the default value of 10
to a different value then you can
use the fold
parameter. For example compare_models(fold = 5)
will compare all
models on 5 fold cross validation. Reducing the number of folds will improve the training time. By
default, compare_models
return the best performing model based on default sort order but can
be used to return a list of top N models by using n_select
parameter.
print(best_model)
create_model
is the most granular function in PyCaret and is often the foundation behind
most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using
cross validation that can be set with fold
parameter. The output prints a score grid that
shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold.
For the remaining part of this tutorial, we will work with the below models as our candidate models. The selections are for illustration purposes only and do not necessarily mean they are the top performing or ideal for this type of data.
There are 18 classifiers available in the model library of PyCaret. To see list of all classifiers either
check the docstring
or use models
function to see the library.
models()
dt = create_model('dt')
#trained model object is stored in the variable 'dt'.
print(dt)
knn = create_model('knn')
rf = create_model('rf')
Notice that the mean score of all models matches with the score printed in compare_models()
.
This is because the metrics printed in the compare_models()
score grid are the average scores
across all CV folds. Similar to compare_models()
, if you want to change the fold parameter
from the default value of 10 to a different value then you can use the fold
parameter. For
Example: create_model('dt', fold = 5)
will create a Decision Tree Classifier using 5 fold
stratified CV.
When a model is created using the create_model()
function it uses the default
hyperparameters to train the model. In order to tune hyperparameters, the tune_model()
function is used. This function automatically tunes the hyperparameters of a model using
Random Grid Search
on a pre-defined search space. The output prints a score grid that shows
Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold for the best model. To use the custom search
grid, you can pass custom_grid
parameter in the tune_model
function (see 9.2 KNN
tuning below).
tuned_dt = tune_model(dt)
#tuned model object is stored in the variable 'tuned_dt'.
print(tuned_dt)
import numpy as np
tuned_knn = tune_model(knn, custom_grid = {'n_neighbors' : np.arange(0,50,1)})
print(tuned_knn)
tuned_rf = tune_model(rf)
By default, tune_model
optimizes Accuracy
but this can be changed using
optimize
parameter. For example: tune_model(dt, optimize = 'AUC')
will search
for the hyperparameters of a Decision Tree Classifier that results in the highest AUC
instead
of Accuracy
. For the purposes of this example, we have used the default metric
Accuracy
only for the sake of simplicity. Generally, when the dataset is imbalanced (such as
the credit dataset we are working with) Accuracy
is not a good metric for consideration. The
methodology behind selecting the right metric to evaluate a classifier is beyond the scope of this
tutorial but if you would like to learn more about it, you can click
here to read an article on how to choose the right evaluation metric.
Metrics alone are not the only criteria you should consider when finalizing the best model for
production. Other factors to consider include training time, standard deviation of kfolds etc. As you
progress through the tutorial series we will discuss those factors in detail at the intermediate and
expert levels. For now, let's move forward considering the Tuned Random Forest Classifier
tuned_rf
, as our best model for the remainder of this tutorial.
Before model finalization, the plot_model()
function can be used to analyze the performance
across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a
trained model object and returns a plot based on the test / hold-out set.
There are 15 different plots available, please see the plot_model()
docstring for the list
of available plots.
plot_model(tuned_rf, plot = 'auc')
plot_model(tuned_rf, plot = 'pr')
plot_model(tuned_rf, plot='feature')
plot_model(tuned_rf, plot = 'confusion_matrix')
Another way to analyze the performance of models is to use the evaluate_model()
function which displays a user interface for all of the available plots for a given model. It internally
uses the plot_model()
function.
evaluate_model(tuned_rf)
Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out
set and reviewing the evaluation metrics. If you look at the information grid in Section 6 above, you will
see that 30% (6,841 samples) of the data has been separated out as test/hold-out sample. All of the
evaluation metrics we have seen above are cross validated results based on the training set (70%) only.
Now, using our final trained model stored in the tuned_rf
variable we will predict against
the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.
predict_model(tuned_rf);
The accuracy on test/hold-out set is 0.8116
compared to
0.8203
achieved on the tuned_rf
CV results (in section 9.3
above). This is not a significant difference. If there is a large variation between the test/hold-out and
CV results, then this would normally indicate over-fitting but could also be due to several other factors
and would require further investigation. In this case, we will move forward with finalizing the model and
predicting on unseen data (the 5% that we had separated in the beginning and never exposed to PyCaret).
(TIP : It's always good to look at the standard deviation of CV results when using
create_model()
.)
Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret
starts with setup()
, followed by comparing all models using compare_models()
and
shortlisting a few candidate models (based on the metric of interest) to perform several modeling
techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you
to the best model for use in making predictions on new and unseen data. The finalize_model()
function fits the model onto the complete dataset including the test/hold-out sample (30% in this case).
The purpose of this function is to train the model on the complete dataset before it is deployed in
production.
final_rf = finalize_model(tuned_rf)
#Final Random Forest model parameters for deployment
print(final_rf)
Caution: One final word of caution. Once the model is finalized using
finalize_model()
, the entire dataset including the test/hold-out set is used for training. As
such, if the model is used for predictions on the hold-out set after finalize_model()
is
used, the information grid printed will be misleading as you are trying to predict on the same data that
was used for modeling. In order to demonstrate this point only, we will use final_rf
under
predict_model()
to compare the information grid with the one above in section 11.
predict_model(final_rf);
Notice how the AUC in final_rf
has increased to 0.7526
from
0.7407
, even though the model is the same. This is because the
final_rf
variable has been trained on the complete dataset including the test/hold-out set.
The predict_model()
function is also used to predict on the unseen dataset. The only
difference from section 11 above is that this time we will pass the data_unseen
parameter.
data_unseen
is the variable created at the beginning of the tutorial and contains 5% (1200
samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)
unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()
The Label
and Score
columns are added onto the data_unseen
set.
Label is the prediction and score is the probability of the prediction. Notice that predicted results are
concatenated to the original dataset while all the transformations are automatically performed in the
background. You can also check the metrics on this since you have actual target column
default
available. To do that we will use pycaret.utils
module. See example
below:
from pycaret.utils import check_metric
check_metric(unseen_predictions['default'], unseen_predictions['Label'], metric = 'Accuracy')
We have now finished the experiment by finalizing the tuned_rf
model which is now stored in
final_rf
variable. We have also used the model stored in final_rf
to predict
data_unseen
. This brings us to the end of our experiment, but one question is still to be
asked: What happens when you have more new data to predict? Do you have to go through the entire
experiment again? The answer is no, PyCaret's inbuilt function save_model()
allows you to
save the model along with entire transformation pipeline for later use.
save_model(final_rf,'Final RF Model 11Nov2020')
(TIP : It's always good to use date in the filename when saving models, it's good for version control.)
To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's
load_model()
function and then easily apply the saved model on new unseen data for
prediction.
saved_final_rf = load_model('Final RF Model 11Nov2020')
Once the model is loaded in the environment, you can simply use it to predict on any new data using the
same predict_model()
function. Below we have applied the loaded model to predict the same
data_unseen
that we used in section 13 above.
new_prediction = predict_model(saved_final_rf, data=data_unseen)
new_prediction.head()
Notice that the results of unseen_predictions
and new_prediction
are identical.
from pycaret.utils import check_metric
check_metric(new_prediction['default'], new_prediction['Label'], metric = 'Accuracy')
This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing,
training the model, hyperparameter tuning, prediction and saving the model for later use. We have
completed all of these steps in less than 10 commands which are naturally constructed and very intuitive
to remember such as create_model()
, tune_model()
, compare_models()
.
Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most
libraries.
We have only covered the basics of pycaret.classification
. In following tutorials we will go
deeper into advanced pre-processing, ensembling, generalized stacking and other techniques that allow you
to fully customize your machine learning pipeline and are must know for any data scientist.
See you at the next tutorial. Follow the link to Binary Classification Tutorial (CLF102) - Intermediate Level