Binary Classification Tutorial (CLF101) - Level Beginner¶

Created using: PyCaret 2.2
Date Updated: November 11, 2020

1.0 Tutorial Objective¶

Welcome to the Binary Classification Tutorial (CLF101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with Binary Classification using the pycaret.classification Module.

In this tutorial we will learn:

Getting Data: How to import data from PyCaret repository
Setting up Environment: How to setup an experiment in PyCaret and get started with building classification models
Create Model: How to create a model, perform stratified cross validation and evaluate classification metrics
Tune Model: How to automatically tune the hyper-parameters of a classification model
Plot Model: How to analyze model performance using various plots
Finalize Model: How to finalize the best model at the end of the experiment
Predict Model: How to make predictions on new / unseen data
Save / Load Model: How to save / load a model for future use

Read Time : Approx. 30 Minutes

1.1 Installing PyCaret¶

The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:

Installing PyCaret in Local Jupyter Notebook¶

pip install pycaret

Installing PyCaret on Google Colab or Azure Notebooks¶

!pip install pycaret

1.2 Pre-Requisites¶

Python 3.6 or greater
PyCaret 2.0 or greater
Internet connection to load data from pycaret's repository
Basic Knowledge of Binary Classification

1.3 For Google colab users:¶

If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab
enable_colab()

1.4 See also:¶

2.0 What is Binary Classification?¶

Binary classification is a supervised machine learning technique where the goal is to predict categorical class labels which are discrete and unoredered such as Pass/Fail, Positive/Negative, Default/Not-Default etc. A few real world use cases for classification are listed below:

Medical testing to determine if a patient has a certain disease or not - the classification property is the presence of the disease.
A "pass or fail" test method or quality control in factories, i.e. deciding if a specification has or has not been met – a go/no-go classification.
Information retrieval, namely deciding whether a page or an article should be in the result set of a search or not – the classification property is the relevance of the article, or the usefulness to the user.

Learn More about Binary Classification

3.0 Overview of the Classification Module in PyCaret¶

PyCaret's classification module (pycaret.classification) is a supervised machine learning module which is used for classifying the elements into a binary group based on various techniques and algorithms. Some common use cases of classification problems include predicting customer default (yes or no), customer churn (customer will leave or stay), disease found (positive or negative).

The PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all.

4.0 Dataset for the Tutorial¶

For this tutorial we will use a dataset from UCI called Default of Credit Card Clients Dataset. This dataset contains information on default payments, demographic factors, credit data, payment history, and billing statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features. Short descriptions of each column are as follows:

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0 to PAY_6: Repayment status by n months ago (PAY_0 = last month ... PAY_6 = 6 months ago) (Labels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
BILL_AMT1 to BILL_AMT6: Amount of bill statement by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
PAY_AMT1 to PAY_AMT6: Amount of payment by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
default: Default payment (1=yes, 0=no) Target Column

Dataset Acknowledgement:¶

Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

The original dataset and data dictionary can be found here.

5.0 Getting the Data¶

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using the get_data() function (This will require an internet connection).

In [1]:

from pycaret.datasets import get_data
dataset = get_data('credit')

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default
0	20000	2	2	1	24	2	2	-1	-1	-2	...	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	90000	2	2	2	34	0	0	0	0	0	...	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
2	50000	2	2	1	37	0	0	0	0	0	...	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
3	50000	1	2	1	57	-1	0	-1	0	0	...	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0
4	50000	1	1	2	37	0	0	0	0	0	...	19394.0	19619.0	20024.0	2500.0	1815.0	657.0	1000.0	1000.0	800.0	0

5 rows × 24 columns

In [2]:

#check the shape of data
dataset.shape

Out[2]:

(24000, 24)

In order to demonstrate the predict_model() function on unseen data, a sample of 1200 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 1200 records are not available at the time when the machine learning experiment was performed.

In [3]:

data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (22800, 24)
Unseen Data For Predictions: (1200, 24)

6.0 Setting up Environment in PyCaret¶

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

In later tutorials we will learn how to overwrite PyCaret's infered data type using the numeric_features and categorical_features parameters in setup().

In [4]:

from pycaret.classification import *

In [5]:

exp_clf101 = setup(data = data, target = 'default', session_id=123)

	Description	Value
0	session_id	123
1	Target	default
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(22800, 24)
5	Missing Values	False
6	Numeric Features	14
7	Categorical Features	9
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(15959, 88)
12	Transformed Test Set	(6841, 88)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	877b
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Clustering	False
45	Clustering Iteration	None
46	Polynomial Features	False
47	Polynomial Degree	None
48	Trignometry Features	False
49	Polynomial Threshold	None
50	Group Features	False
51	Feature Selection	False
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Fix Imbalance	False
57	Fix Imbalance Method	SMOTE

Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup() is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:

session_id : A pseudo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.
Target Type : Binary or Multiclass. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.
Label Encoded : When the Target variable is of type string (i.e. 'Yes' or 'No') instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment no label encoding is required since the target variable is of type numeric.
Original Data : Displays the original shape of the dataset. In this experiment (22800, 24) means 22,800 samples and 24 features including the target column.
Missing Values : When there are missing values in the original data this will show as True. For this experiment there are no missing values in the dataset.
Numeric Features : The number of features inferred as numeric. In this dataset, 14 out of 24 features are inferred as numeric.
Categorical Features : The number of features inferred as categorical. In this dataset, 9 out of 24 features are inferred as categorical.
Transformed Train Set : Displays the shape of the transformed training set. Notice that the original shape of (22800, 24) is transformed into (15959, 91) for the transformed train set and the number of features have increased to 91 from 24 due to categorical encoding
Transformed Test Set : Displays the shape of the transformed test/hold-out set. There are 6841 samples in test/hold-out set. This split is based on the default value of 70/30 that can be changed using the train_size parameter in setup.

Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation (in this case there are no missing values in the training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in setup() are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.

7.0 Comparing All Models¶

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC accross the folds (10 by default) along with training times.

In [6]:

best_model = compare_models()

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
ridge	Ridge Classifier	0.8254	0.0000	0.3637	0.6913	0.4764	0.3836	0.4122	0.0490
lda	Linear Discriminant Analysis	0.8247	0.7634	0.3755	0.6794	0.4835	0.3884	0.4132	0.1880
gbc	Gradient Boosting Classifier	0.8225	0.7790	0.3548	0.6800	0.4661	0.3721	0.4005	2.0260
ada	Ada Boost Classifier	0.8221	0.7697	0.3505	0.6811	0.4626	0.3690	0.3983	0.4990
lightgbm	Light Gradient Boosting Machine	0.8220	0.7759	0.3591	0.6745	0.4685	0.3734	0.4003	0.2180
rf	Random Forest Classifier	0.8180	0.7618	0.3591	0.6531	0.4631	0.3645	0.3884	0.8030
xgboost	Extreme Gradient Boosting	0.8160	0.7561	0.3629	0.6391	0.4626	0.3617	0.3829	2.0060
et	Extra Trees Classifier	0.8082	0.7381	0.3669	0.6010	0.4553	0.3471	0.3629	0.8700
lr	Logistic Regression	0.7814	0.6410	0.0003	0.1000	0.0006	0.0003	0.0034	0.7680
knn	K Neighbors Classifier	0.7547	0.5939	0.1763	0.3719	0.2388	0.1145	0.1259	0.3000
svm	SVM - Linear Kernel	0.7285	0.0000	0.1003	0.1454	0.0957	0.0067	0.0075	0.1610
dt	Decision Tree Classifier	0.7262	0.6134	0.4127	0.3832	0.3970	0.2204	0.2208	0.1140
qda	Quadratic Discriminant Analysis	0.5171	0.5572	0.6286	0.2731	0.3675	0.0892	0.1014	0.1480
nb	Naive Bayes	0.3760	0.6442	0.8845	0.2441	0.3826	0.0608	0.1207	0.0290
catboost	CatBoost Classifier	0.2453	0.2303	0.1064	0.1957	0.1378	0.1083	0.1156	5.0130

Two simple words of code (not even a line) have trained and evaluated over 15 models using cross validation. The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example compare_models(sort = 'Recall') will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time. By default, compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameter.

In [7]:

print(best_model)

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=123, solver='auto',
                tol=0.001)

8.0 Create a Model¶

create_model is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using cross validation that can be set with fold parameter. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold.

For the remaining part of this tutorial, we will work with the below models as our candidate models. The selections are for illustration purposes only and do not necessarily mean they are the top performing or ideal for this type of data.

Decision Tree Classifier ('dt')
K Neighbors Classifier ('knn')
Random Forest Classifier ('rf')

There are 18 classifiers available in the model library of PyCaret. To see list of all classifiers either check the docstring or use models function to see the library.

In [8]:

models()

Out[8]:

	Name	Reference	Turbo
ID
lr	Logistic Regression	sklearn.linear_model._logistic.LogisticRegression	True
knn	K Neighbors Classifier	sklearn.neighbors._classification.KNeighborsCl...	True
nb	Naive Bayes	sklearn.naive_bayes.GaussianNB	True
dt	Decision Tree Classifier	sklearn.tree._classes.DecisionTreeClassifier	True
svm	SVM - Linear Kernel	sklearn.linear_model._stochastic_gradient.SGDC...	True
rbfsvm	SVM - Radial Kernel	sklearn.svm._classes.SVC	False
gpc	Gaussian Process Classifier	sklearn.gaussian_process._gpc.GaussianProcessC...	False
mlp	MLP Classifier	pycaret.internal.tunable.TunableMLPClassifier	False
ridge	Ridge Classifier	sklearn.linear_model._ridge.RidgeClassifier	True
rf	Random Forest Classifier	sklearn.ensemble._forest.RandomForestClassifier	True
qda	Quadratic Discriminant Analysis	sklearn.discriminant_analysis.QuadraticDiscrim...	True
ada	Ada Boost Classifier	sklearn.ensemble._weight_boosting.AdaBoostClas...	True
gbc	Gradient Boosting Classifier	sklearn.ensemble._gb.GradientBoostingClassifier	True
lda	Linear Discriminant Analysis	sklearn.discriminant_analysis.LinearDiscrimina...	True
et	Extra Trees Classifier	sklearn.ensemble._forest.ExtraTreesClassifier	True
xgboost	Extreme Gradient Boosting	xgboost.sklearn.XGBClassifier	True
lightgbm	Light Gradient Boosting Machine	lightgbm.sklearn.LGBMClassifier	True
catboost	CatBoost Classifier	catboost.core.CatBoostClassifier	True

8.1 Decision Tree Classifier¶

In [9]:

dt = create_model('dt')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.7331	0.6239	0.4298	0.3979	0.4132	0.2408	0.2411
1	0.7325	0.6359	0.4642	0.4030	0.4314	0.2576	0.2587
2	0.7419	0.6254	0.4183	0.4113	0.4148	0.2492	0.2492
3	0.7256	0.6116	0.4097	0.3813	0.3950	0.2179	0.2181
4	0.7124	0.6127	0.4355	0.3671	0.3984	0.2113	0.2126
5	0.7193	0.6111	0.4155	0.3728	0.3930	0.2111	0.2116
6	0.7212	0.6098	0.4126	0.3750	0.3929	0.2125	0.2129
7	0.7287	0.5932	0.3524	0.3727	0.3623	0.1902	0.1903
8	0.7105	0.5898	0.3754	0.3493	0.3619	0.1750	0.1752
9	0.7373	0.6207	0.4138	0.4011	0.4074	0.2387	0.2387
Mean	0.7262	0.6134	0.4127	0.3832	0.3970	0.2204	0.2208
SD	0.0099	0.0134	0.0292	0.0185	0.0209	0.0249	0.0249

In [10]:

#trained model object is stored in the variable 'dt'. 
print(dt)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

8.2 K Neighbors Classifier¶

In [11]:

knn = create_model('knn')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.7469	0.6020	0.1920	0.3545	0.2491	0.1128	0.1204
1	0.7550	0.5894	0.2092	0.3883	0.2719	0.1402	0.1500
2	0.7506	0.5883	0.1576	0.3459	0.2165	0.0923	0.1024
3	0.7419	0.5818	0.1519	0.3136	0.2046	0.0723	0.0790
4	0.7563	0.5908	0.1490	0.3611	0.2110	0.0954	0.1085
5	0.7550	0.5997	0.1748	0.3720	0.2378	0.1139	0.1255
6	0.7638	0.5890	0.1891	0.4125	0.2593	0.1413	0.1565
7	0.7613	0.6240	0.1633	0.3904	0.2303	0.1163	0.1318
8	0.7619	0.5988	0.1862	0.4037	0.2549	0.1356	0.1500
9	0.7549	0.5756	0.1897	0.3771	0.2524	0.1246	0.1351
Mean	0.7547	0.5939	0.1763	0.3719	0.2388	0.1145	0.1259
SD	0.0065	0.0126	0.0191	0.0279	0.0214	0.0214	0.0230

8.3 Random Forest Classifier¶

In [12]:

rf = create_model('rf')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8114	0.7666	0.3467	0.6237	0.4457	0.3430	0.3645
1	0.8264	0.7527	0.3897	0.6800	0.4954	0.3998	0.4224
2	0.8258	0.7701	0.3496	0.7052	0.4674	0.3772	0.4104
3	0.8195	0.7662	0.3754	0.6517	0.4764	0.3768	0.3977
4	0.8177	0.7654	0.3524	0.6543	0.4581	0.3601	0.3851
5	0.8283	0.7750	0.3897	0.6904	0.4982	0.4041	0.4282
6	0.8076	0.7717	0.3352	0.6094	0.4325	0.3283	0.3495
7	0.8195	0.7401	0.3381	0.6743	0.4504	0.3564	0.3868
8	0.8095	0.7461	0.3582	0.6098	0.4513	0.3453	0.3632
9	0.8144	0.7643	0.3563	0.6327	0.4559	0.3544	0.3756
Mean	0.8180	0.7618	0.3591	0.6531	0.4631	0.3645	0.3884
SD	0.0069	0.0110	0.0186	0.0322	0.0202	0.0234	0.0249

Notice that the mean score of all models matches with the score printed in compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores across all CV folds. Similar to compare_models(), if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For Example: create_model('dt', fold = 5) will create a Decision Tree Classifier using 5 fold stratified CV.

9.0 Tune a Model¶

When a model is created using the create_model() function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model using Random Grid Search on a pre-defined search space. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold for the best model. To use the custom search grid, you can pass custom_grid parameter in the tune_model function (see 9.2 KNN tuning below).

9.1 Decision Tree Classifier¶

In [13]:

tuned_dt = tune_model(dt)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8177	0.7475	0.3095	0.6835	0.4260	0.3355	0.3728
1	0.8289	0.7711	0.3381	0.7375	0.4637	0.3782	0.4190
2	0.8208	0.7377	0.2894	0.7266	0.4139	0.3305	0.3796
3	0.8252	0.7580	0.3152	0.7333	0.4409	0.3563	0.4010
4	0.8195	0.7545	0.3095	0.6968	0.4286	0.3398	0.3794
5	0.8271	0.7509	0.3438	0.7186	0.4651	0.3769	0.4134
6	0.8195	0.7488	0.3123	0.6943	0.4308	0.3415	0.3801
7	0.8246	0.7529	0.2980	0.7482	0.4262	0.3446	0.3957
8	0.8195	0.7241	0.3123	0.6943	0.4308	0.3415	0.3801
9	0.8188	0.7378	0.3075	0.6903	0.4254	0.3362	0.3751
Mean	0.8222	0.7483	0.3136	0.7123	0.4352	0.3481	0.3896
SD	0.0037	0.0122	0.0156	0.0219	0.0159	0.0161	0.0158

In [14]:

#tuned model object is stored in the variable 'tuned_dt'. 
print(tuned_dt)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=1.0, max_leaf_nodes=None,
                       min_impurity_decrease=0.002, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

9.2 K Neighbors Classifier¶

In [15]:

import numpy as np
tuned_knn = tune_model(knn, custom_grid = {'n_neighbors' : np.arange(0,50,1)})

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.7813	0.6482	0.0372	0.5000	0.0693	0.0402	0.0876
1	0.7807	0.6436	0.0315	0.4783	0.0591	0.0330	0.0759
2	0.7744	0.6563	0.0315	0.3333	0.0576	0.0206	0.0403
3	0.7845	0.6589	0.0659	0.5610	0.1179	0.0754	0.1345
4	0.7826	0.6645	0.0315	0.5500	0.0596	0.0368	0.0903
5	0.7794	0.6477	0.0544	0.4634	0.0974	0.0539	0.0961
6	0.7826	0.6278	0.0630	0.5238	0.1125	0.0688	0.1214
7	0.7751	0.6702	0.0372	0.3611	0.0675	0.0278	0.0523
8	0.7813	0.6409	0.0630	0.5000	0.1120	0.0662	0.1146
9	0.7881	0.6426	0.0661	0.6389	0.1198	0.0822	0.1548
Mean	0.7810	0.6501	0.0482	0.4910	0.0873	0.0505	0.0968
SD	0.0039	0.0119	0.0148	0.0861	0.0255	0.0206	0.0338

In [16]:

print(tuned_knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=42, p=2,
                     weights='uniform')

9.3 Random Forest Classifier¶

In [17]:

tuned_rf = tune_model(rf)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8158	0.7508	0.3181	0.6647	0.4302	0.3363	0.3689
1	0.8283	0.7675	0.3295	0.7419	0.4563	0.3719	0.4152
2	0.8139	0.7337	0.3181	0.6529	0.4277	0.3321	0.3628
3	0.8246	0.7588	0.3095	0.7347	0.4355	0.3514	0.3976
4	0.8170	0.7567	0.3438	0.6557	0.4511	0.3539	0.3805
5	0.8258	0.7513	0.3324	0.7205	0.4549	0.3676	0.4067
6	0.8170	0.7529	0.3324	0.6629	0.4427	0.3474	0.3771
7	0.8221	0.7507	0.3381	0.6901	0.4538	0.3621	0.3951
8	0.8177	0.7201	0.2980	0.6933	0.4168	0.3286	0.3699
9	0.8207	0.7484	0.3132	0.6987	0.4325	0.3439	0.3831
Mean	0.8203	0.7491	0.3233	0.6915	0.4402	0.3495	0.3857
SD	0.0045	0.0126	0.0135	0.0310	0.0129	0.0140	0.0165

By default, tune_model optimizes Accuracy but this can be changed using optimize parameter. For example: tune_model(dt, optimize = 'AUC') will search for the hyperparameters of a Decision Tree Classifier that results in the highest AUC instead of Accuracy. For the purposes of this example, we have used the default metric Accuracy only for the sake of simplicity. Generally, when the dataset is imbalanced (such as the credit dataset we are working with) Accuracy is not a good metric for consideration. The methodology behind selecting the right metric to evaluate a classifier is beyond the scope of this tutorial but if you would like to learn more about it, you can click here to read an article on how to choose the right evaluation metric.

Metrics alone are not the only criteria you should consider when finalizing the best model for production. Other factors to consider include training time, standard deviation of kfolds etc. As you progress through the tutorial series we will discuss those factors in detail at the intermediate and expert levels. For now, let's move forward considering the Tuned Random Forest Classifier tuned_rf, as our best model for the remainder of this tutorial.

10.0 Plot a Model¶

Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set.

There are 15 different plots available, please see the plot_model() docstring for the list of available plots.

10.1 AUC Plot¶

In [18]:

plot_model(tuned_rf, plot = 'auc')

10.2 Precision-Recall Curve¶

In [19]:

plot_model(tuned_rf, plot = 'pr')

10.3 Feature Importance Plot¶

In [20]:

plot_model(tuned_rf, plot='feature')

10.4 Confusion Matrix¶

In [21]:

plot_model(tuned_rf, plot = 'confusion_matrix')

Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [22]:

evaluate_model(tuned_rf)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

11.0 Predict on test / hold-out Sample¶

Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. If you look at the information grid in Section 6 above, you will see that 30% (6,841 samples) of the data has been separated out as test/hold-out sample. All of the evaluation metrics we have seen above are cross validated results based on the training set (70%) only. Now, using our final trained model stored in the tuned_rf variable we will predict against the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.

In [23]:

predict_model(tuned_rf);

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Random Forest Classifier	0.8116	0.7407	0.3436	0.6650	0.4531	0.3530	0.3811

The accuracy on test/hold-out set is 0.8116 compared to 0.8203 achieved on the tuned_rf CV results (in section 9.3 above). This is not a significant difference. If there is a large variation between the test/hold-out and CV results, then this would normally indicate over-fitting but could also be due to several other factors and would require further investigation. In this case, we will move forward with finalizing the model and predicting on unseen data (the 5% that we had separated in the beginning and never exposed to PyCaret).

(TIP : It's always good to look at the standard deviation of CV results when using create_model().)

12.0 Finalize Model for Deployment¶

Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.

In [24]:

final_rf = finalize_model(tuned_rf)

In [25]:

#Final Random Forest model parameters for deployment
print(final_rf)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={},
                       criterion='entropy', max_depth=5, max_features=1.0,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0002, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False)

Caution: One final word of caution. Once the model is finalized using finalize_model(), the entire dataset including the test/hold-out set is used for training. As such, if the model is used for predictions on the hold-out set after finalize_model() is used, the information grid printed will be misleading as you are trying to predict on the same data that was used for modeling. In order to demonstrate this point only, we will use final_rf under predict_model() to compare the information grid with the one above in section 11.

In [29]:

predict_model(final_rf);

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Random Forest Classifier	0.8184	0.7526	0.3533	0.6985	0.4692	0.3736	0.4053

Notice how the AUC in final_rf has increased to 0.7526 from 0.7407, even though the model is the same. This is because the final_rf variable has been trained on the complete dataset including the test/hold-out set.

13.0 Predict on unseen data¶

The predict_model() function is also used to predict on the unseen dataset. The only difference from section 11 above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning of the tutorial and contains 5% (1200 samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)

In [30]:

unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()

Out[30]:

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default	Label	Score
0	100000	2	2	2	23	0	-1	-1	0	0	...	567.0	380.0	601.0	0.0	581.0	1687.0	1542.0	0	0	0.8051
1	380000	1	2	2	32	-1	-1	-1	-1	-1	...	11873.0	21540.0	15138.0	24677.0	11851.0	11875.0	8251.0	0	0	0.9121
2	200000	2	2	1	32	-1	-1	-1	-1	2	...	3151.0	5818.0	15.0	9102.0	17.0	3165.0	1395.0	0	0	0.8051
3	200000	1	1	1	53	2	2	2	2	2	...	149531.0	6300.0	5500.0	5500.0	5500.0	5000.0	5000.0	1	1	0.7911
4	240000	1	1	2	41	1	-1	-1	0	0	...	1737.0	2622.0	3301.0	0.0	360.0	1737.0	924.0	0	0	0.9121

5 rows × 26 columns

The Label and Score columns are added onto the data_unseen set. Label is the prediction and score is the probability of the prediction. Notice that predicted results are concatenated to the original dataset while all the transformations are automatically performed in the background. You can also check the metrics on this since you have actual target column default available. To do that we will use pycaret.utils module. See example below:

In [31]:

from pycaret.utils import check_metric
check_metric(unseen_predictions['default'], unseen_predictions['Label'], metric = 'Accuracy')

Out[31]:

0.8167

14.0 Saving the model¶

We have now finished the experiment by finalizing the tuned_rf model which is now stored in final_rf variable. We have also used the model stored in final_rf to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [32]:

save_model(final_rf,'Final RF Model 11Nov2020')

Transformation Pipeline and Model Succesfully Saved

Out[32]:

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='default',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_stra...
                  RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                         class_weight={}, criterion='entropy',
                                         max_depth=5, max_features=1.0,
                                         max_leaf_nodes=None, max_samples=None,
                                         min_impurity_decrease=0.0002,
                                         min_impurity_split=None,
                                         min_samples_leaf=5,
                                         min_samples_split=10,
                                         min_weight_fraction_leaf=0.0,
                                         n_estimators=150, n_jobs=-1,
                                         oob_score=False, random_state=123,
                                         verbose=0, warm_start=False)]],
          verbose=False), 'Final RF Model 11Nov2020.pkl')

(TIP : It's always good to use date in the filename when saving models, it's good for version control.)

15.0 Loading the saved model¶

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [33]:

saved_final_rf = load_model('Final RF Model 11Nov2020')

Transformation Pipeline and Model Successfully Loaded

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below we have applied the loaded model to predict the same data_unseen that we used in section 13 above.

In [34]:

new_prediction = predict_model(saved_final_rf, data=data_unseen)

In [35]:

new_prediction.head()

Out[35]:

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default	Label	Score
0	100000	2	2	2	23	0	-1	-1	0	0	...	567.0	380.0	601.0	0.0	581.0	1687.0	1542.0	0	0	0.8051
1	380000	1	2	2	32	-1	-1	-1	-1	-1	...	11873.0	21540.0	15138.0	24677.0	11851.0	11875.0	8251.0	0	0	0.9121
2	200000	2	2	1	32	-1	-1	-1	-1	2	...	3151.0	5818.0	15.0	9102.0	17.0	3165.0	1395.0	0	0	0.8051
3	200000	1	1	1	53	2	2	2	2	2	...	149531.0	6300.0	5500.0	5500.0	5500.0	5000.0	5000.0	1	1	0.7911
4	240000	1	1	2	41	1	-1	-1	0	0	...	1737.0	2622.0	3301.0	0.0	360.0	1737.0	924.0	0	0	0.9121

5 rows × 26 columns

Notice that the results of unseen_predictions and new_prediction are identical.

In [36]:

from pycaret.utils import check_metric
check_metric(new_prediction['default'], new_prediction['Label'], metric = 'Accuracy')

Out[36]:

0.8167

16.0 Wrap-up / Next Steps?¶

This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction and saving the model for later use. We have completed all of these steps in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model(), tune_model(), compare_models(). Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most libraries.

We have only covered the basics of pycaret.classification. In following tutorials we will go deeper into advanced pre-processing, ensembling, generalized stacking and other techniques that allow you to fully customize your machine learning pipeline and are must know for any data scientist.

See you at the next tutorial. Follow the link to Binary Classification Tutorial (CLF102) - Intermediate Level