web123456

Machine Learning Chapter 5: Decision Tree Modeling

decision tree model

    • 5.1 Fundamentals of decision tree modeling
      • 5.1.1 Introduction to Decision Tree Modeling
      • 5.1.2 Basis for decision tree modeling
      • 5.1.3 Code Implementation of Decision Tree Models
    • 5.2 Case Study: Employee Departure Prediction Modeling
      • 5.2.1 Model construction
      • 5.2.2 Model predictions and assessment
      • 5.2.3 Decision Tree Model Visualization and Understanding Decision Tree Elements
    • 5.3 Parameter Tuning - K-fold Cross Validation & GridSearch GridSearch
      • 5.3.1 K-fold cross validation
      • 5.3.2 GridSearch grid search (parameter tuning)
      • 5.3.3 Pre- and Post-Pruning of Decision Trees
    • 5.4 Case Practice: Bank Customer Default Prediction Modeling
      • 5.4.1 Model construction
      • 5.4.2 Model predictions and assessment
      • 5.4.3 Visual presentation of the model

5.1 Fundamentals of decision tree modeling

This subsection first introduces the basic concepts of decision modeling and the mathematics behind it, and then describes how to implement a decision tree model through Python, thus providing a basis for the subsequent employee departurepredictive modelThe build is paved.
在这里插入图片描述
Click to download.

5.1.1 Introduction to Decision Tree Modeling

The decision tree model ismachine learningOne of the more understandable of the various algorithmic models, its basic principle is to make if/else derivation of a series of problems, and ultimately realize the relevant decisions.
The figure below shows a simple demonstration of a typical decision tree model: the employee departure prediction model. The decision tree first determines whether the employee satisfaction is less than 5, the answer is "yes" that the employee will leave, the answer is "no" then determine whether the income is less than 10,000 yuan, the answer is "yes" that the employee will leave, the answer is "no" that the employee will not leave. If the answer is "yes", the employee will leave, and if the answer is "no", the employee will not leave.
在这里插入图片描述
This is just a simple demonstration, after the actual employee departure prediction model will be based on big data to build a slightly more complex model, but the core principle of the decision tree model is the content shown in the figure above. And business practice will not be based solely on "satisfaction" and "income <10,000" two characteristics to determine whether to leave, but according to theMultiple featuresnextPredicted probability of separation, and based on the corresponding threshold to determine whether to leave, such as the probability of leaving more than 50% that the employee will be considered to leave.

A few important keywords of decision tree modeling are explained here: root node, parent node, child node and leaf node.

parent noderespond in singingchild nodeis relative, the child node is split from the parent node according to some rule, and then the child node continues to split as the new parent node until it cannot be split.root nodefollowleaf nodeIt is relative, the root node is the node with no parent, i.e. the initial node, and the leaf node is the node with no children, i.e. the final node. The key to decision tree modeling i.e. how to choose the right node for splitting.

For example, in the above figure, the top "Satisfaction<5" is the root node, which is also a parent node, split into two child nodes "Leaving" and "Income<10,000", of which the child node "Leaving" is also called a leaf node because it no longer has children; the other child node "Income<10,000" is also called a leaf node because it no longer has children. ", where the child node "leaving" is also called a leaf node because it is no longer split and no longer has children; the other child node "income<10,000" is also the parent of the two nodes below it, and the last "leaving" is the root node, which is also a parent node, split into two child nodes "leaving" and "income<10,000". Another child node "Income<10,000" is also the parent of the following two nodes, and the last "leaving" and "not leaving" are leaf nodes.

In practice, companies will look at the existing data to see what characteristics of the departing employees are in line with, such as checking their satisfaction, income, length of service, monthly working hours, the number of projects, etc., and then select the corresponding characteristics of the node split, you can build a decision tree model similar to that shown in the figure above. The decision tree model can be used to predict the departure of employees according to thedata analysisResults in appropriate responses.

The concept of decision tree itself is not complicated, mainly through continuous logical judgment to get the final conclusion, the key lies in how to build such a "tree". The key lies in how to build such a "tree". For example, the root node should choose which feature, choose "Satisfaction<5" as the root node and choose "Income<10,000" as the root node will play a different effect. Secondly, as a continuous variable, it is important to choose "Income<10,000" as a node or "Income<100,000" as a node. The following is to explain the decision tree modeling basis.

5.1.2 Basis for decision tree modeling

The basis for building a decision tree model is primarily used is the concept of the Gini coefficient. The Gini coefficient (gini) is used to calculate the disorder in a system, i.e. the degree of chaos in the system. The higher the gini coefficient, the higher the degree of system disorder, the purpose of building a decision tree model is to reduce the degree of system disorder, so as to arrive at the appropriate data categorization effect, the formula for calculating the gini coefficient is as follows:
在这里插入图片描述
included among these在这里插入图片描述classification在这里插入图片描述The frequency of occurrence in the sample T, i.e., the category of在这里插入图片描述of the sample to the total number of samples. The ∑ is a summation formula, i.e., the sum of all the在这里插入图片描述Perform the summation.

For example, for a sample of all departing employees, there is only one category in it: departing employees, which occurs 100% of the time, so the Gini coefficient for this system is在这里插入图片描述The Gini coefficient for the sample is 2, which means that the system is not chaotic, or that the system has a high degree of "purity". If half of the sample consists of employees who have left and the other half consists of employees who have not left, then the number of categories is 2 and the frequency of each category is 50%, so the Gini coefficient is在这里插入图片描述, i.e., its level of confusion is high.
When a certain variable used to perform categorization is introduced (e.g., "satisfaction<5"), the split Gini coefficient equation is:
在这里插入图片描述

included among these在这里插入图片描述在这里插入图片描述The sample size divided into two categories, gini(T1) and gini(T2) are the Gini coefficients for each of the two categories after division.
For example, an initial sample has 1,000 employees, of whom 400 are known to leave and 600 do not. The Gini coefficient for this system before its division is在这里插入图片描述Then the following two different divisions are used to determine the initial node: 1) classification based on "satisfaction <5"; 2) classification based on "income <10,000".
Division 1: The initial node of "satisfaction<5" is used for division, and the Gini coefficient after division is 0.3, as shown in the following figure.
在这里插入图片描述
Division 2: The initial node "Income <10,000" is used for division, and the Gini coefficient after division is 0.45, as shown in the figure below.

It can be seen that the Gini coefficient is 0.48 when it is not divided, 0.3 when it is divided with "satisfaction<5" as the initial node, and 0.45 when it is divided with "income<10,000" as the initial node. The lower the Gini coefficient, the lower the confusion of the system, the higher the differentiation, and the better it can be used as a classification prediction model, so "satisfaction<5" is chosen as the initial node. Here demonstrates how to select the initial node, the nodes below the initial node are also selected in a similar way.
Similarly, for the variable "income", the choice of "income <10,000" or "income <100,000" is also determined by calculating the Gini coefficient in both cases. This is also determined by calculating the Gini coefficient after the division in these two cases. If there are other variables, such as "length of service", "monthly working hours" and so on, also through similar means to calculate the Gini coefficient after the division of the system, to see how to carry out the division of the nodes, so as to build a more complete decision tree model. The decision tree using the Gini coefficient is also called the CART decision tree.Italicized style

Supplementary knowledge: information entropy
Another classic measure of system disorder: information entropy, is added here for the interested reader.
The role of information entropy and Gini coefficient are basically the same, both are used to measure the level of chaos in the system so as to carry out the appropriate division of nodes. The formula for information entropy H(X) is shown below:
在这里插入图片描述
Where X represents a random variable, the value of the random variable (X1, X2, X3 ......), in the n classification problem, there are n values, for example, in the case of the employee departure prediction model, the value of X is two kinds of: "leave" and "do not leave"; pi represents the random variable X value for the probability of Xi occurs, and there is ∑pi = 1. In addition, note that the logarithmic function is based on 2, which is the bottom. "do not leave"; pi indicates that the random variable X takes the value of the probability of occurrence of Xi, and there is ∑pi = 1. In addition, note that the logarithmic function here is based on a base of 2 that is在这里插入图片描述

Again, for example, for a sample of all departing employees, there is only one category in there: departing employees, which occurs 100% of the time, so the information entropy of that system is在这里插入图片描述, indicating that the system is not chaotic. And if half of the sample are separated employees and the other half are unseparated employees, then the number of categories is 2, and each category occurs with a frequency of 50%, so its information entropy is在这里插入图片描述, i.e., its level of confusion is high.
When a certain variable (e.g., "satisfaction<5") used to perform classification is introduced, the entropy of information after division according to variable A is also called the conditional entropy, which is given by the formula:
在这里插入图片描述

where S1 and S2 are the sample sizes divided into two classes, and H(X1) and H(X2) are the information entropy of each of the two classes after division.
Similar to the previous calculation of the Gini coefficient reduction value, here is the same calculation of the reduction value of information entropy (original system entropy value - system entropy value after classification), the reduction value is called the entropy gain or information gain, the larger its value, the better, the larger indicates that after the classification of the lower degree of confusion, i.e., the more accurate classification.
information gainThe formula for the calculation is shown below:
在这里插入图片描述
To explain the concept and use of information entropy with the previous example, there are 1,000 employees in the initial sample, of which 400 are known to leave and 600 do not. The information entropy of this system before its division is在这里插入图片描述As a result of the high level of confusion, two different divisions are used to determine the initial node: 1) Classification according to "ever left the company"; 2) Classification according to "income <10,000".
Way 1: Take "satisfaction<5" as the initial node to divide, as shown in the following figure, the information entropy after division is 0.65, and the entropy gain or information gain is 0.32.
在这里插入图片描述
Way 2: Take "Income<10,000" as the initial node for division, as shown in the following figure, the Gini coefficient after division is 0.96, and the entropy gain or information gain is 0.046.
在这里插入图片描述
The information gain after dividing according to way 1 is 0.32, which is greater than the information gain after dividing according to way 2 is 0.046, so we choose to divide the decision tree according to way 1, which can better reduce the degree of confusion of the system, so as to carry out a more reasonable classification. This is the same as the final conclusion of the previous calculation using the Gini coefficient.
In decision tree modeling, since the Gini coefficient involves a squaring operation and the information entropy involves a complex one-point log-log function operation, the current decision tree modelingBy default, the Gini coefficient is used for the calculationThis way the computation will be faster.

The amount of data in commercial practice is usually very large, and then calculate the Gini coefficient or information entropy in different cases is not humanly possible, then you need to use the machine to keep training to find the best split node, and in Python, there is a corresponding Scikit-Learn library to help quickly build a decision tree model, if it is through the Chapter 1 by theAnacondaPython installed, then this library has been automatically installed, we will explain the simple code implementation of the decision tree model.

5.1.3 Code Implementation of Decision Tree Models

Decision tree models can be used for both categorical analysis (i.e., predicting the values of categorical variables) and regression analysis (i.e., predicting the values of continuous variables), corresponding to the models DecisionTreeClassifier and DecisionTreeRegressor, respectively.

1. Classification DecisionTreeClassifier
A simple code demonstration of the categorical decision tree model is shown below:

from sklearn.tree import DecisionTreeClassifier
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [1, 0, 0, 1, 1]

model = DecisionTreeClassifier(random_state=0)
model.fit(X, y)

print(model.predict([[5, 5]]))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Line 1 of the code introduces the classification decision tree model: the DecisionTreeClassifier;
X in the 2nd line of code is the feature variable, there are 5 training data, each data has 2 features, such as data [1, 2], which has a value of 1 for the first feature and 2 for the second feature;
The y in line 3 of the code is the target variable, and there are two categories: 0 and 1;
Line 5 introduces the model and sets the random state parameter random_state to the number 0. The number itself has no special meaning and can be replaced with another number. It is a seed parameter that makes the result consistent every time the code is run, and this parameter will be explained in the additional knowledge points in this section;
The 6th line trains the model by fit() function; the last 1 line predicts by predict() function and the prediction results are as follows:

[0]
  • 1

It can be seen that for the data [5, 5] it is categorized into the category 0.
If more than one data is to be predicted at the same time, it can be written in the following form:

print(model.predict([[5, 5], [7, 7], [9, 9]])) 
  • 1

The results of the projections are shown below:

[0 0 1]
  • 1

For ease of understanding, the decision tree is visualized using the decision tree visualization technique that will be covered in Section 5.2.3, as shown below:
在这里插入图片描述
First of all, some basic concepts: X[0] in the graph represents the first feature of the data, while X[1] represents the second feature of the data; gini coefficient of the node (take the root node as an example, its gini coefficient calculation formula is:在这里插入图片描述); samples denotes the number of samples in the node; value denotes the number of samples accounted for by each category, e.g., [2, 3] in the root node denotes that the number of samples categorized as 0 is 2, and the number of samples categorized as 1 is 3; class denotes the category into which the block is divided, which is determined by the number of which category is more in the value, e.g., the number of samples categorized as 1 in the root node (3) is greater than the number of samples classified as 0 (2), so the classification of this node is 1, and the rest are in order.

The topmost node, i.e., the root node, is divided into nodes based on whether X[1] is less than or equal to 7. If the condition is satisfied (i.e., True), it is divided into the left child node, otherwise (i.e., False) it is divided into the right node. Take the data [5, 5] as an example, at the root node, it satisfies the condition that X[1] (i.e., the second eigenvalue) is less than or equal to 7, so it is divided to the left child node. At that sub-node, another judgment is made to determine whether X[0] is less than or equal to 2, because the value of X[0] is 5, which does not satisfy the condition, so it is divided into the node to the right of that sub-node, and the category class in that node is 0, so the data [5, 5] is predicted to be in category 0 under this decision tree model.

Additional knowledge: explanation of the role of the random_state parameter
When introducing the decision tree model, we set the random_state random_state parameter, the reason for setting this parameter is that the decision tree model will prioritize the node division by choosing the division that makes the biggest decrease in the Gini coefficient of the whole system, but it is possible (especially when the amount of data is small) that the decrease in the Gini coefficient obtained according to the different divisions is the same. The figure below shows the different decision trees obtained after multiple runs without setting the random_state parameter:
在这里插入图片描述

It can be seen that they are divided into nodes in different ways, which will result in different predictions for the same data, for example the data [7, 7] will be predicted as category 0 in the decision tree on the left, while it will be predicted as category 1 in the decision tree on the right.
At this point some readers will have questions, why the model training will produce two different trees, which tree is correct? In fact, both trees are correct, the reason for this situation, because according to "X[1]<=7" or "X[0]<=6" node division of the Gini coefficient decline is the same (Both 0.48 - (0.6)0.444 + 0.40) = 0.2136)So it is reasonable to divide the nodes in either form. The reason for this phenomenon is largely due to the small amount of data so it is easy to produce the same Gini coefficient drop produced by different ways of division, when the amount of data is larger the chances of the phenomenon is smaller.

In general, for the same model, different ways of division may lead to the final prediction results will be different, set random_state parameter (can be set to 0, can also be set to 1 or 123 and other arbitrary numbers) can ensure that each time the division of the same way, so that the results of each run is the same. This concept is quite important for beginners, because beginners often find how to run the same model every time out of the same results are not the same, thus confused, if this happens, then set the random_state parameter can be.

2. Regression decision tree model (DecisionTreeRegressor)
In addition to categorical analysis, the decision tree can also be regression analysis, that is, the prediction of continuous variables, this time the decision tree is called regression decision tree, regression decision tree model simple code demonstration is shown below:(Understanding bank customer value as an example)

from sklearn.tree import DecisionTreeRegressor
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [1, 2, 3, 4, 5]

model = DecisionTreeRegressor(max_depth=2, random_state=0)
model.fit(X, y)

print(model.predict([[9, 9]]))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Where X is the feature variable, which has a total of 2 features; y is the target variable, which is a continuous variable; line 5 introduces the model and sets the maximum depth parameter of the decision tree max_depth to 2 and the random state parameter random_state to 0; line 6 trains the model through the fit() function; and the last line makes a prediction through the predict() function. The prediction result is as follows:

[4.5]
  • 1

It can be seen that for the data [9, 9] the predicted fit is 4.5.
The concept of regression decision tree model is basically the same as that of classification decision tree, the biggest difference is that the cut-off criterion is no longer the information entropy or Gini coefficient, but the mean square error MSE, and the formula of the mean square error MSE is shown below:
在这里插入图片描述
where n is the number of samples.在这里插入图片描述is the actual value.在这里插入图片描述is the fitted value (predicted value).
For ease of understanding, the decision tree is visualized using the decision tree visualization technique that will be covered in Section 5.2.3, as shown below:
在这里插入图片描述

In the figure, X[0] denotes the first feature of the data, while X[1] denotes the second feature of the data; mse denotes the mean square error of the node; samples denotes the number of samples of the node; note that value here denotes the fitted value of the node, and in the regression decision tree, theThe mean of all the data in a node is used as the fitted value for that node, for the final leaf node, the fitted value is the final regressionmodel predictionValue.

For example, for the root node, which has a total of 5 data in it, here is theTake the mean of all the data in the node as the fitted value for that node, so for this node the fitted value在这里插入图片描述is (1+2+3+4+5)/5 = 3, so its mean square error MSE is shown below as the number 2, which is consistent with the results obtained by the program.
在这里插入图片描述

For the regression decision tree, its purpose is to make the final system to minimize the mean square error, its node division is also based on this idea, such as the root node is based on "X[1]<=5" for the division of the system to make the maximum decline in the mean square error (2-(0.4)).0.25 + 0.60.667) = 1.5)。
If the depth of the decision tree is not limited, then the decision tree will continue to extend down the tree until the mean square error MSE in all leaf nodes is equal to 0. Here, because the maximum depth parameter max_depth of the tree is set to 2, the decision tree has a total of two layers down the tree at the root node, and if this parameter is not set, the node at the lower right corner will continue to split until the value of the mean square error of all nodes is is 0. The reason for setting the maximum depth parameter max_depth is to facilitate the demonstration of the fitting effect (the fitting result is 4.5 instead of an integer, which appears to be a regression result rather than a classification result), and secondly, to prevent the model from overfitting phenomenon (the knowledge of overfitting can be found in the book section 3.2.3). In practice, we also usually set the maximum depth parameter max_depth mainly to prevent the model from overfitting.
As to why the fit is 4.5 for this data of [9, 9], I believe that it should be readily apparent by looking at the graph above.

In practice, the classification decision tree model is used relatively more, but the classification decision tree and regression decision tree model are very important, and then 8, 9, 10 chapters on the integrated model of random forest model, AdaBoost model, GBDT model, XGBoost and LightGBM model are all based on the decision tree model to build. After understanding the simple use of decision tree models here, the next section we will combine specific business cases to explain how to use Python to build decision tree models.

Supplementary knowledge: overfitting and underfitting
As shown in the figure below, the so-called overfitting (referred to as overfitting) means that the model is overfitted in the training samples, and although it fits the training set data well, it loses the ability to generalize, and the model is not generalizable (i.e., it does not achieve better predictions if the data outside of the training set is changed), leading to poor performance in the test dataset. In addition, the opposite of overfitting is underfitting, which means that the model does not fit the data well, the data is far away from the fitting curve, or the model does not capture the data features well enough to fit the data well.

在这里插入图片描述

5.2 Case Study: Employee Departure Prediction Modeling

In this section, we will learn the application of decision tree models in the talent decision-making domain by building an employee exit prediction model, and will explain how to evaluate the decision tree model, and finally present the decision tree model through visualization.

5.2.1 Model construction

The purpose of employee exit prediction modeling is to build a suitable model from existing employee information and exit performance to predict whether employees will leave afterwards.

1. Data reading and pre-processing
First read the employee information and its transaction exit performance, that is, whether the exit record, the code is as follows:

import pandas as pd
df = pd.read_excel('Employee Turnover Prediction Model.xlsx')
df.head()
  • 1
  • 2
  • 3

The results are shown in the following table, in which there are 15,000 sets of historical data, of which the first 3,571 are the data of departed employees, and the last 11,429 are the data of non-separated employees, of which the number 1 in the column of "Separation" represents the departure, and the number 0 represents the non-departure. Our goal is to build a decision tree model based on these historical data to predict the likelihood of employees leaving the company.
在这里插入图片描述
Because Python mathematical modeling can not recognize the text content, and in the original data, "salary" is divided into three levels "high", "medium", "low". So the content of the "salary" column needs to be processed numerically, here through the pandas library in the replace () function for processing, where "high" with "2 " said, "in" with "1" said, "low" with "0" is used to represent the "high".

df = df.replace({'Wages': {'Low': 0, 'Medium': 1, 'High': 2}})
df.head()
  • 1
  • 2

After text processing, the result is shown below:
在这里插入图片描述
In addition to the replace () function to text data processing, in section 11.1 there are Get_dummies dummy variable processing and Label Encoding number processing two means of processing, interested readers can read.
In the data table, "leaving" is used as the target variable, and the rest of the fields are used as feature variables to determine whether an employee will leave the company by his characteristics. In order to facilitate the demonstration, only six feature variables are selected here, and the feature variables used in business practice will be much more than those in the case. Next, we enter the construction of the decision tree model, which is a regular step in the construction of most machine learning models.

2. Extraction of feature variables and target variables
Firstly, the feature variables and target variables are extracted separately and the code is as follows:

X = df.drop(columns='Separation') 
y = df['Separation']     
  • 1
  • 2

Delete the column "whether to leave" by using drop() function, and assign the remaining data as a feature variable to variable X. Then extract the column "whether to leave" as a target variable by using DataFrame to extract columns, and assign the value to variable y. variable y.

3. Divide the training set and test set
After extracting the feature variables, we need to split the original 15,000 data into a training set and a test set. As the name suggests, the training set is used for training, while the test set is used to test the results of model training.
The code for dividing the training and test sets is as follows:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
  • 1
  • 2

The first line of code is to introduce the train_test_split() function from the Scikit-Learn library;
The 2nd line of code then divides the training set and test set by train_test_split() function, where X_train, y_train are the data in the training set and X_test, y_test are the data in the test set.
The first two parameters X and y of the train_test_split() function are the previously extracted feature variables and target variables; test_size is the proportion of data in the test set, and here 20% of the data is selected as the test data, i.e., 0.2. If there is a large amount of data, it can also be set to 0.1, i.e., a relatively small amount of data is allocated for testing, and a larger proportion of data is allocated for training. more percentage of data for training.
Because through the train_test_split() function each time to divide the data is random, so if you want to divide the data each time to produce the same content, you can set the random_state parameter, here set the number 123 does not have a special meaning, it is just the equivalent of a seed parameter, so that each time to make the division of the content is the same, but you can set the value of can also be set to other values.
The segmented data is shown below:
在这里插入图片描述
4. Model training and construction
After dividing into training set and test set, the decision tree model can be introduced from Scikit-Learn library for model training, the code is as follows:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=123) 
model.fit(X_train, y_train)
  • 1
  • 2
  • 3

The first of these lines of code introduces the classification decision tree model from the Scikit-Learn library: the DecisionTreeClassifier;
The second line of code assigns the decision tree model to the model variable, while setting the model parameter max_depth to 3, i.e., the maximum depth of the tree is 3, the concept of the model parameter will be explained in the subsequent 5.3 section of the parameter tuning chapter, and set the random state parameter random_state to the number 123, this number has no special significance, but only to make the result of each run This number has no special significance, it just makes the result of each run consistent;
The third line of code performs the training of the model through the fit() function, and the parameters passed in are the training set data obtained earlier.
Up to this point.A decision tree model has been built, the code for decision tree model building is summarized here as follows:

# 1.Reading data with simple preprocessing
import pandas as pd
df= pd.read_excel('Employee Turnover Prediction Model.xlsx')
df = df.replace({'Wages': {'Low': 0, 'Medium': 1, 'High': 2}})

# 2.Extraction of feature and target variables
X= df.drop(columns='Separation') 
y = df['Separation']   

# 3.Divide the training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# 4.Model Training and Construction
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=123)
model.fit(X_train, y_train)    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

Once the model is built it is time to use the model to make predictions, at which point the previous test set comes into play, i.e., the test set is used to make predictions and evaluate the model's predictions.

5.2.2 Model predictions and assessment

This subsection will describe how to directly predict whether or not to leave a job, as well as predicting the probability of not leaving & leaving a job, and will conclude with how to rationalize the evaluation of a model.

1. Direct prediction of separation
The purpose of building a model is to use it to predict data, here the data from the test set is imported into the model to make predictions, the code is as follows, where model is the decision tree model built in the previous section.

y_pred = model.predict(X_test)
  • 1

The results are viewed by printing y_pred[0:100] as shown below, where 0 and 1 are the predicted results, with 0 being the prediction of no separation and 1 being the prediction of separation.
在这里插入图片描述
Using the knowledge points related to creating DataFrames, the predicted y_pred and the actual y_test of the test set are aggregated together, where y_pred is anumpy.ndarray one-dimensional array structure, y_test for Series one-dimensional sequence structure, so you need to use the list () function to convert it to a list, the code is as follows:

a = pd.DataFrame()  # Create an empty DataFrame
a['Predicted value'] = list(y_pred)
a['Actual value'] = list(y_test)
  • 1
  • 2
  • 3

The first five rows of the generated DataFrame are printed out as follows via ().
在这里插入图片描述
You can see that the prediction accuracy of the first five data sets in the test dataset is 100%, and if you want to see the overall prediction accuracy, you can use the following code:

from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
  • 1
  • 2

By printing out the SCORE printout, the results are as follows:

0.9573
  • 1

The prediction accuracy of the whole model on the test set was found to be 0.9573, i.e., out of 3000 test set data, 2872 predictions matched the actual results.
In addition, we can also check the prediction accuracy of the model by using the score function that comes with the model with the following code:

model.score(X_test, y_test)
  • 1

Printing it out, the same result of 0.9573 was obtained.

2. Predicting non-separation & probability of separation
In fact, the categorical decision tree model essentially predicts not the exact 0 or 1 classification, but the probability of its belonging to a certain classification, you can view the predicted probability of belonging to each classification by the following code:

y_pred_proba = model.predict_proba(X_test)
  • 1

At this point the y_pred_proba is predicted to belong to the probability of each classification, it is a two-dimensional array, you can directly print y_pred_proba, the left column of the number of its classification for the probability of 0, the right column of the number of classified for the probability of 1; you can also be through the following code will be loaded into the format of a DataFrame convenient to view:

b = pd.DataFrame(y_pred_proba, columns=['Probability of non-separation', 'Probability of leaving']) 
  • 1

View the first 5 rows of the form obtained at this point by printing() as follows:
在这里插入图片描述
Previously, when directly predicting whether to leave the job, in fact, it is to see which classification belongs to the greatest probability, such as the first line of data, in which the probability of not leaving the job is 0.9852, which is greater than the probability of leaving the job is 0.0147, so it is predicted that it does not leave the job.
Some careful readers may have found that some of the above probabilities are the same, for example, there is the first and second employee's probability of not leaving are 0.9852, the probability of leaving are 0.0147, in the next subsection will be the model visualization after the presentation of the principle of calculation of these probabilities you will understand.
If you want to view the probability of leaving, that is, to view the second column of y_pred_proba, you can use the following code, this is a two-dimensional array of selected columns of the method, in which ":" before the comma indicates all rows, the number 1 after the comma indicates the second column, if the number 1 is changed to the number 0, then extract the first column of the non-leave probability .

y_pred_proba[:,1]
  • 1

3. Assessment of the effectiveness of model predictions
As mentioned in Chapter 4, Logistic Regression Modeling, Section 4.3, for classification models, we are not only concerned with the accuracy of their predictions, but also with the following two metrics: the hit rate (the rate of all employees who actually leave the job who are predicted to leave) and the false alarm rate (the rate of all employees who do not actually leave the job who are predicted to leave), i.e., the model is judged by the ROC curve plotted by both.
We want the false alarm rate to be as small as possible and the hit rate to be as high as possible for the same threshold, i.e., the ROC curve is as steep as possible and its corresponding AUC value (area under the ROC curve) is as high as possible.
On the Python implementation, the code talked about in Section 4.3 makes it possible to find the values of the hit rate (TPR) as well as the false alarm rate (FPR) at different thresholds, which makes it possible to plot the ROC curve.

from sklearn.metrics import roc_curve
fpr, tpr, thres = roc_curve(y_test, y_pred_proba[:,1])
  • 1
  • 2

Where the first line of code introduces the roc_curve() function. The second line of code passes in the value of the target variable y_test for the test set, and the predicted probability of leaving, and then calculates the hit rate and false alarm rate under different thresholds through the roc_curve() function, and assigns the three to the variables fpr (false alarm rate), tpr (hit rate), and thres (threshold), at which time fpr, tpr, and thres are obtained as three 1D arrays The following is an example of a one-dimensional array of three variables. Note that the roc_curve() function returns a tuple with three elements, of which the default first element is the false alarm rate, the second element is the hit rate, and the third element is the threshold, so when writing the order of variables in the second line of code, you should write them in the order of fpr, tpr, and thres.
The false alarm rate and hit rate at different thresholds can be viewed with the relevant code in Section 4.3, which is as follows:

a = pd.DataFrame()  # Create an empty DataFrame
a['Threshold'] = list(thres)
a['False alarm rate'] = list(fpr)
a['Hit rate'] = list(tpr)
  • 1
  • 2
  • 3
  • 4

At this point, form a is shown in the table below:
在这里插入图片描述
There are a couple of points of note, the first line indicates that an employee is judged to leave only if the probability that he or she is predicted to leave >=200%, because the probability will not exceed 100%, so no one will be judged to leave at this point, i.e., everyone will not be predicted to leave, and then both the hit rate and the false alarm rate will be 0. (It is also explained in the book's section 4.3.2 why there is such a not-so-meaningful meaningful threshold for this reason). The second line indicates that only when an employee is predicted to leave the probability >=100% (because the probability will not exceed 100%, so in fact it is predicted to leave the probability of 100%), it is determined that it will leave the job, as shown in the table above, at this time, the hit rate of 24.7%, that is, all the actual departure of the employee is predicted to leave the rate of 24.7%, in this extreme threshold conditions. The hit rate has been very high, this in fact this in the next section after learning the model visualization will have a better understanding of the presentation;

The third line indicates that an employee is judged to leave only if the probability that he or she is predicted to leave is >=94.6%, at which point the hit rate is 67.8%, the false alarm rate is 0.82%, and so on for the rest.

The false alarm rate and hit rate at different thresholds are known, and the ROC curve can be plotted with the matplotlib library can be plotted with the following code:

import matplotlib.pyplot as plt
plt.plot(fpr, tpr)
plt.show()
  • 1
  • 2
  • 3

The plotted ROC curve is shown below, and you can see that this ROC curve is still steep.
在这里插入图片描述
The following code is then used to quickly find the AUC value of the model:

from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, y_pred_proba[:,1])
  • 1
  • 2

The first line of code in this case introduces the roc_auc_score() function. The second line of code passes in the value of the test set target variable y_test, and the predicted probability of leaving. The AUC value will be obtained and printed out as: 0.973, which can be said that the prediction is still good.

4. Characteristic importance assessment
After the model is built, sometimes we would like to know the importance of each feature variable, i.e., which feature variables play a greater role in the model, and in the decision tree model, the feature importance can be viewed by the following line of code:

model.feature_importances_
  • 1

The following results were obtained and the sum of the importance of these features is 1.

array([0, 0.59810862, 0.14007392, 0.10638659, 0.00456495, 0.15086592])
  • 1

In the decision tree model, the size of the feature importance lies in the size of the contribution of the variable to the overall Gini coefficient decline, the larger the contribution of the feature variable to the overall Gini coefficient decline of the model, the larger the importance of its features. For example, if the model splits to the last leaf node, the Gini coefficient of the whole system decreases by 0.3, and if the sum of the Gini coefficient decreases produced by all the nodes that split according to the feature A is 0.15, then the feature importance of feature A is 50%, i.e., 0.5.
For models with few feature variables, we can view the importance of each feature variable with the above line of code, but if there are more feature variables, you can use the following code to match the feature importance to the variable name:

features = X.columns # Get feature names
importances= model.feature_importances_ # Get feature importance

# Display in a 2D table
importances_df= pd.DataFrame()
importances_df['Feature name'] = features
importances_df['Characteristic importance'] = importances
importances_df.sort_values('Characteristic importance', ascending=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

which the first two lines of code to obtain the feature name and feature importance, and then through the 2.2.1 section of the construction of two-dimensional form DataFrame of the relevant knowledge will be feature name and feature importance of integration into a two-dimensional form, and finally through the sort_values () function according to the importance of the features to be sorted, which set the ascending parameter False means that according to the reverse order of the sorting (i.e., sorting from largest to smallest).
To add, which is organized into a two-dimensional table can also be achieved through the following line of code, note that this time the table is displayed horizontally, that is, its line index for the feature name and the importance of the feature, so it needs to be transposed to the vertical display through the .

importances_df = pd.DataFrame([features, importances], index=['Feature name', 'Characteristic importance']).T 
  • 1

At this point, the content of the table is obtained as shown in the table below, so that the feature name and feature importance correspond to each other.
在这里插入图片描述
We can see that the most important characteristic variable is the first one: satisfaction, which is also consistent with the common sense that if the employee's job satisfaction is high, the probability of leaving the job is relatively small, and vice versa is relatively large. In addition, the other important ones are: appraisal score and length of service; the feature importance of salary in the model is 0, which means that they do not play a role, this is somewhat because we limit the depth of the decision tree to 3 levels (max_depth=3), so the feature variable of salary does not have a chance to play a role, if the maximum depth of the tree, then the salary may play a role, this is shown in Section 5.3.2, which is a good example. This is also verified in section 5.3.2. Another important reason is that in this case, the wage is not a specific value, but is divided into three ranges: "high", "medium" and "low", which is too broad. This classification is too broad, and thus the variable of salary plays a smaller role in the decision tree model. If the salary here is the actual salary income of the employee, such as 10,000 RMB, then it should play a bigger role.

5.2.3 Decision Tree Model Visualization and Understanding Decision Tree Elements

If you want to present the decision tree model visually, you can use the graphviz plugin for Python. Because the model visualization is mainly for demonstration and teaching, in the real combat application is less, so for the installation and use of graphviz, interested readers can refer to the tutorials written by the author: /docs/Dcgw8H6WxgWrc8hq/ , or scan the following two-dimensional code to view, in addition to the corresponding code and tutorials will be provided in the source code supporting the book.
在这里插入图片描述
Here is a brief mention of the core knowledge points of graphviz plugin usage:
The first step is to install the graphviz plugin, the official website for its installation is: /download/, if it isWindowssystem, then download the msi file and install it, after the installation is complete, you need to configure the environment variables, the specific configuration method refer to the tutorial link given above or through the QR code to scan the code to watch.
The second step is to install the graphviz library using thepip install graphviz to install the libraries.
After installation can be used, the core code is as follows, through the following code can quickly visualize the decision tree model. If you want to generate a visualization containing Chinese pictures is relatively troublesome, but the author has also studied the corresponding solution, interested readers refer to the tutorial link given above or through the QR code to scan the code to watch.

from sklearn.tree import export_graphviz
import graphviz
dot_data = export_graphviz(model, out_file=None)  # where model is the name of the model
graph= graphviz.Source(dot_data)  # Visualize decision tree model models
graph.render('Decision Tree Visualization')  # Generate decision tree visualization PDFs
  • 1
  • 2
  • 3
  • 4
  • 5

The following figure shows the visual decision tree model generated by graphviz:
在这里插入图片描述
It can be seen that there are only 3 layers of tree structure from the initial node downwards, which is also the model parameter max_depth set in subsection 5.2.1, i.e. the maximum depth of the tree. The reason for setting max_depth is to facilitate the demonstration, and the other reason is that if the depth of the tree is too deep, it will lead to overfitting of the model and a decrease in the prediction effect.
Here are a few important points to deepen your understanding of the decision tree model based on the diagram above.
Knowledge point 1: Meaning of each element of a node
The meaning of the content in each node in the figure is described here. In addition to the leaf nodes, each node has five elements: split basis, gini (current Gini coefficient), samples (current number of samples), value (number of each category in the samples), and class (categorical category).
Take the root node as an example, its split is based on whether the satisfaction is less than or equal to 4.65 as the basis for splitting; its current Gini coefficient is 0.365; the total number of samples it contains is 12,000; the value of which the value on the left side of the value of 9120 indicates that "whether to leave the job" in the 0, that is, do not leave the job, the value of 2880 indicates that "whether to leave" in the 1, that is, leaving the job; the class in the last line indicates the classification, because the number of non-leaving employees (9120) is more than the number of leaving employees (9120) on this node. the value of 2880 that "whether to leave" in the 1, that is, leaving the staff; the last line of the class indicates that the classification, because in this node does not leave the number of employees (9120) more than the number of employees leaving (2880), so the class for the non-leaving, but the root node of the class does not have any meaning, we mainly look at the last leaf node of the class, we can see the value of the node, the class of the leaf node. We mainly look at the class situation of the last leaf node. In addition, the final leaf node no longer has the item split basis because it has already been split.

Knowledge point 2: Node division and its basis validation
The root node produces two child nodes after splitting, in which most of the child nodes on the left are separated employees (a total of 3,367 people, of which 1,325 have not left and 2,042 have left, and the Gini coefficient of this node is 0.477); while most of the nodes on the right are not separated employees (a total of 8,633 people, of which 7,795 have not left and 838 have left, and the Gini coefficient of this node is 0.175), which does correspond to the reality of the experience that if satisfaction is low there is a higher likelihood of separation.
The Gini coefficient of the system after root node splitting is calculated from the knowledge points in subsection 5.1.2 as 3367/120000.477 + 8633/120000.175 = 0.260 (at this time the Gini coefficient decline value of 0.365 - 0.260 = 0.105), this is also the machine through non-stop training and calculation of the optimal solution, if the root node split by other means of the system Gini coefficient will be bigger than this, the Gini coefficient of the decline value of the Gini coefficient must be smaller than this.

Knowledge point 3: Relationship between feature importance and whole tree
The above point can also be verified here by the code used in the previous section to calculate the characteristic importance of the feature variables:

model.feature_importances_
  • 1

The printouts are as follows, corresponding to the feature importance of each of the six feature variables.

array([0, 0.59810862, 0.14007392, 0.10638659, 0.00456495, 0.15086592])
  • 1

It can be seen that the most important factor here is satisfaction. It can also be better explained why the characteristic variable wage has a characteristic importance of 0. This is because this factor does not play a role in the model, as can be seen from the visualization of the graphs, where each bifurcated node is not split on the basis of the characteristic variable "wage", so this characteristic variable does not play a role. So this feature variable does not play a role. If the maximum depth of max_depth is set to be larger, so that the tree can continue to split downwards, then this feature variable may play a role, so that the importance of the feature is no longer zero.
In addition, as we mentioned before, in the decision tree model, the importance of a feature lies in the contribution of the variable to the overall decline of the Gini coefficient, and here we can verify this idea by demonstrating why the importance of the second feature variable "Satisfaction" is 0.598, using this visualized decision tree.
Firstly, the overall Gini coefficient decrease needs to be calculated, as shown in the following figure, the weights are summed according to the number of samples in different leaf nodes, and the Gini coefficient of the new system is 0.0814, and the overall Gini coefficient decrease is 0.365 - 0.0814 = 0.2836.
在这里插入图片描述
Taking "satisfaction" as an example, in the above decision tree it plays a role in the root node and the lower middle node, and the Gini coefficient drop on the system is 0.105 (0.365 - (3367/12000)) respectively.0.477 + 8633/12000(0.175) = 0.105) and 0.0646 (1971/12000*0.484 - ((714/12000)*0 + (1257/12000)*0.142) = 0.0646), the sum of which is 0.1696, which is the characteristic of "satisfaction". This is the role played by the variable "satisfaction" in the overall model, divided by the overall decline in the Gini coefficient is the importance of its characteristics: 0.1696/0.2836 = 0.598, which is consistent with the code to obtain the importance of the characteristics.

Knowledge Point 4: Basis for the cessation of leaf division
There are two main reasons for a leaf to stop splitting: it has finished splitting and can't split any more, or it has reached the limited splitting conditions. For example, the Gini coefficients of the leaf nodes in the lower right corner are all 0, which means that the purity of this leaf node is already the highest (i.e., all the elements in it are of the same category), and it does not need to be split any more, nor can it be split any more. Some leaf nodes whose Gini coefficients have not yet reached 0 will not continue to split downwards because the maximum depth of the tree is limited to 3. In addition, leaf nodes do not have a split basis element because they do not need to continue splitting.

Knowledge point 5: The relationship between the probability of not leaving & leaving and leaf nodes
Note that the calculation of the probability of non-separation & separation mentioned previously in subsection 5.2.2 is based on the leaf nodes, and if it is divided into the third leaf node in the lower left corner, then its probability of non-separation is 0 and its probability of separation is 100%, so it is judged as a separation;If divided into the leftmost leaf node, there are a total of 1,295 data in this node, there are 70 people who do not leave and 1,225 who leave, then if a new employee is divided into that leaf node, the probability of the employee leaving is judged to be 1,225/1,295 = 0.946, and the probability of not leaving is 70/1,295 = 0.054, because the probability of leaving is greater than the probability of not leaving, so the probability of leaving is judged to be 1,225/1,295 = 0.946. So it is judged to leave the job, and the rest are in order.

Knowledge point 6: ROC curves with leaf nodes
In addition, the interested reader can calculate the exit probabilities reflected by the remaining leaf nodes (from the left leaf node to the right leaf node, the exit probabilities of each node are 94.6%, 5.94%, 100%, 7.72%, 1.47%, 100%, 4.58%, and 71.4%, respectively), and then look at the thresholds used in plotting the ROC curves in the previous subsection, and **will realize that the thresholds used in plotting the ROC curve in the previous subsection were not chosen arbitrarily, but are the probabilities of leaving as reflected by these different leaf nodes. **The ROC curve is plotted by using the values of these exit probabilities as thresholds to see the hit rate (TPR) and false alarm rate (FPR) at different thresholds. In addition, since many of the leaf nodes here have an exit probability of 100%, this explains why the previous subsection 5.2.2 took 100% as the threshold and still had a 24.7% hit rate.

Through this figure will be able to better understand the logic of the operation of the decision tree, when a new data comes, it will start from the top of the root node to make judgments, if it meets the satisfaction <=4.65, it will be divided into the left node to carry out a series of judgments after that, if it does not meet the node is divided into the right node to carry out a series of judgments after that, and finally the new data will be divided into one of the a leaf node, thus completing the prediction of the data.

Supplementary Knowledge: Decision Tree Visualization - Installation and Use of graphviz Plugin
Here is a simple explanation of the decision tree visualization skills: graphviz plug-in installation and use for interested readers. The detailed principles can be referred to in the source code files attached to the book in the PDF document, here mainly explain the core points.
Plug-in Installation
First of all, you need to install the graphviz plug-in, the download address is: /download/, to the Windows version, for example, in the download site, select the following box in the content: Stable 2.38 Windows installpackages
在这里插入图片描述
Then download the msi file: graphviz-2. as shown below. After downloading the msi file, click on the file to install it. Note that you should remember the path of the file to be installed, which will be used when deploying environment variables later, usually the default path is C:\Program Files (x86)\Graphviz2.38\, it is recommended that you do not modify it.
在这里插入图片描述
2. Install the graphviz library
Take Windows system as an example, Win + R key combination to bring up the system run box, type cmd and click OK, in the pop-up box, type pip install graphviz, click enter and wait for the end of the installation can be. If you are in Jupyter Notebook editor, you can run the code!pip install graphviz in the code box.

Use of libraries
A visual decision tree model can be generated with the following code:

from sklearn.tree import export_graphviz
import graphviz
import os 
os.environ['PATH'] = os.pathsep + r'C:\Program Files (x86)\Graphviz2.38\bin'
dot_data = export_graphviz(model, out_file=None, class_names=['0', '1'])
graph = graphviz.Source(dot_data)
graph.render("result")  # Export to PDF
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

The first 2 lines introduce the relevant libraries that use graphviz;
Line 3,4 set the environment variables so that you can really use the graphviz plugin in Python, this belongs to the method of setting environment variables manually (if you want to set the environment variables in the whole computer system rather than a single code file, you can refer to the PDF document attached to the source code file in this chapter), here graphviz plugin installation path is: "C:\Program Files (x86)\Graphviz2.38", the environment variable deployment is the software is located in the file path of the bin folder deployed to the runtime environment, if it is the other installation path, then modify their own can be.

Line 5 converts the previously built decision tree model to string format and assigns it to dot_data via the export_graphviz() method, where note that you need to set the out_file parameter to None so that what you get is in string format;

The 6th line of code then converts the dot_data into a visual format;
The seventh line of code is through the render () method will be the image output, through the above code is the default output of a PDF file named result, the PDF file is shown below:
在这里插入图片描述
X[1] represents the 2nd feature variable: satisfaction, X[3] represents the 4th feature variable: number of projects, and X[5] represents the 6th feature variable: length of service; gini represents the Gini coefficient of the node; samples represents the number of samples in the node, for example, the first node, i.e., the root node, is 12,000, i.e., the number of samples in the training set; value indicates the number of different categories, for example, 9120 on the left side of the value in the root node indicates the number of employees who do not leave, 2880 indicates the number of employees who leave, and class=0 considers the node as a node of no turnover.
If you want to generate the colorful visualization results shown at the beginning of this subsection with Chinese characters (because graphviz does not directly support Chinese, so you need to do the following special processing), you can use the following code. The code looks a little complicated, but for readers and friends, the use of the code only need to adjust the seventh line of code export_graphviz () function in parentheses of the model (model name), feature_names (feature name) and class_names (category name) three contents can be.

from sklearn.tree import export_graphviz
import graphviz
import os # The following two lines configure the environment variables that allow Python to use the graphviz plugin.
os.environ['PATH'] = os.pathsep + r'C:\Program Files (x86)\Graphviz2.38\bin'

# Generate dot_data
dot_data= export_graphviz(model, out_file=None, feature_names=X_train.columns, class_names=['No separation', 'Separation'], rounded=True, filled=True)  # rounded is related to the font, filled sets the color filling

# Import the contents of the generated dot_data into a txt file
f= open('dot_data.txt', 'w')
f.write(dot_data)
f.close()

# Modify the font settings to avoid garbled Chinese!
import re
f_old= open('dot_data.txt', 'r')
f_new = open('dot_data_new.txt', 'w', encoding='utf-8')
for line in f_old:
    if 'fontname' in line:
        font_re = 'fontname=(.*?)]'
        old_font = re.findall(font_re, line)[0]
        line = line.replace(old_font, 'SimHei')
    f_new.write(line)
f_old.close()
f_new.close()

# Store generated visualizations as PNG images
os.system('dot -Tpng dot_data_new.txt -:: Decision tree model.png')  
print('DecisionTreeModel.png has been saved in the folder where the code is located!')

# Store generated visualizations as PDFs
os.system('dot -Tpdf dot_data_new.txt -:: Decision tree model.pdf')  
print('DecisionTreeModel.pdf has been saved in the folder where the code is located!')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33

5.3 Parameter Tuning - K-fold Cross Validation & GridSearch GridSearch

Each model of machine learning actually has some built-in parameters, such as the above mentioned decision tree model is very important parameter: max_depth (the maximum depth of the tree), this kind of parameter is also known as hyperparameters, in addition to max_depth, the decision tree model there are also some commonly used parameters, such as criteria (feature selection criteria), min_samples_leaf (minimum number of samples of leaf nodes), etc. For more parameters, you can refer to the supplementary knowledge in this section: decision tree model hyperparameters. leaf (minimum number of samples for leaf nodes), etc. For more parameters, you can refer to the supplemental knowledge point in this section: hyperparameters of decision tree models.
In most cases, using the default parameters of the model can also get better results and prediction accuracy, however, if you want to make the model better, then you need to tune the hyper-parameters of the model, for example, whether max_depth is taken as 3, or the default value of max_depth is None (no max_depth, i.e., Gini coefficient of splitting to all leaf nodes is 0) is a good idea, if max_depth is taken too small it may lead to model underfitting, while if it is taken too large, it will easily lead to model overfitting (see underfitting and overfitting for knowledge points). If max_depth is taken too small, it may lead to underfitting of the model, while if it is taken too large, it will easily lead to overfitting of the model (see section 4.2.3 in the previous chapter for the knowledge of underfitting and overfitting), so a means to regulate these parameters is needed.
This section introduces a common method for adjusting model parameters: GridSearch and K-fold cross validation, which is used in conjunction with GridSearch. Before learning about GridSearch, let's first understand the basic principles of K-fold cross validation.

5.3.1 K-fold cross validation

In machine learning, because the training and test sets divide the data randomly**, we sometimes reuse the data repeatedly to better evaluate the validity of the model and select the best model, a practice called cross-validation. Specifically, the original sample data is sliced and diced and then combined into multiple different training and test sets, with the training set used to train the model and the test set used to evaluate the model. The training set of one time may be the test set of the next time, so it is called cross-validation. **

There are three methods of cross-validation which are categorized as simple cross-validation, K-fold cross-validation and leave-one-out cross-validation.Among them, K-fold cross-validation is more widely used, K-fold cross-validation means that the data set is randomly divided into K equal parts, and each time K-1 parts are selected as the training set to train the model, and then the remaining 1 part is used as the test set, and after obtaining K models, the average test effect of the K models will be used as the final model effect.

For example, the following diagram shows3 fold cross validation, that is, the data is randomly divided into 3 equal parts, and then each time randomly selected 2 copies of the data as a training set, the remaining 1 copy as a test set, repeated 3 times, to get 3 different test results, the synthesis to look at the 3 different test results can be more accurate assessment of the model.
在这里插入图片描述
In general, if the training dataset is relatively small, the value of k is increased so that more data will be used for model training during each iteration while the algorithm time is prolonged; if the training set is relatively large, the value of k is decreased so as to reduce the computational cost of the model's performance evaluation for repeated fitting on different blocks of data, and to obtain an accurate evaluation of the model based on the average performance.
In addition to evaluating the model more accurately, theAnother important role of cross-validation is parameter tuning of the model using more accurate evaluation results, and it is often used in conjunction with GridSearch grid search, which will be covered in the next section.

Additional knowledge: code implementation of K-fold validation
K-fold cross validation can be achieved with the following code and the score of each validation can be obtained:

from sklearn.model_selection import cross_val_score
acc = cross_val_score(model, X, y, cv=5)
  • 1
  • 2

The first line of code introduces the cross validation function cross_val_score function; the second line of code uses the cross_val_score function to perform cross validation, in which the model name (model), the feature variable data (X), the target variable data (y), and the number of times of cross validation (cv) are passed in, respectively.Here cv=5 means that cross-validation is done 5 times, each time 4/5 of the data is randomly taken for training and 1/5 of the data is used for testing (if this parameter is not filled in, the cv value defaults to 3). In addition, the scoring parameter is not set here, i.e., it is considered that the default value 'accuracy' is selected for scoring.
Printing acc you can see the scoring obtained from the 5 cross validations is shown below:

array([0.96666667, 0.96066667, 0.959     , 0.96233333, 0.91366667])
  • 1

Get the average of these five scores by printing the following code:

acc.mean()
  • 1

The following results were obtained:

0.9524666666666667
  • 1

The above cross-validation defaults to accuracy as the evaluation criterion, if you want to use the AUC value of the ROC curve as the scoring criterion, then you can set the scoring parameter to 'roc_auc' with the following code:

acc = cross_val_score(model, X, y, scoring='roc_auc', cv=5)
  • 1

After the GridSearch grid search to talk about, we actually no longer need to use cross_val_socre function, then will use the GridSearchCV function at the same time for cross-validation and parameter tuning, so about the cross_val_socre function can be simple to understand.

5.3.2 GridSearch grid search (parameter tuning)

Grid search is a means of tuning parameters for exhaustive search:Iterate through all the candidate parameters, loop through the model and evaluate the validity and accuracy of the model, and select the best performing parameter as the final result. Taking the maximum depth max_depth of the decision tree model as an example, we can traverse through these different values of [1, 3, 5, 7, 9] and search for the most suitable max_depth value by using the accuracy or the AUC value of the ROC curve as a criterion.If you want to adjust multiple model parameters at the same time, for example, the model has two parameters, the first parameter has four possibilities, the second parameter has five possibilities, all the possibilities listed can be expressed as a 4 * 5 grid, the traversal process is like in the grid (Grid) in the search (Search), so the method is also known as the GridSearch Grid Search.

1. Parameter tuning with a single parameter
Here's a quick look at how machine learning performs parameter tuning by first demonstrating grid search with 1 parameter (max_depth). Parameter tuning of the decision tree model above using the GridSearchCV() method from the Scikit-Learn library.

from sklearn.model_selection import GridSearchCV  
parameters = {'max_depth': [1, 3, 5, 7, 9]}  
model = DecisionTreeClassifier()  
grid_search = GridSearchCV(model, parameters, scoring='roc_auc', cv=5) 
  • 1
  • 2
  • 3
  • 4

The first line of code introduces the GridSearchCV() method from the Scikit-Learn library;
The 2nd line of code specifies the parameter range of the parameter max_depth to be tuned in the decision tree model;
Line 3 of the code constructs the decision tree model and assigns it to the variable model;
The 4th line of code passes the parameter ranges of the classifiers and the parameters to be tuned into the GridSearchCV() model and sets cv=5 to indicate that it takes 5 folds of cross-validation, i.e., cross-validation 5 times, and its default is 3. Here, the setting of scoring='roc_auc' indicates that the AUC score of the ROC curves is used as the evaluation criterion for the model, and if it is not set The default value is 3. Here, setting scoring='roc_auc' means to use the AUC score of the ROC curve as the evaluation criterion of the model, if not set, the default value is 'accuracy'.
Below we pass the data into the grid search model and output the optimal values of the parameters:

grid_search.fit(X_train, y_train)  # Pass in test set data and start parameter tuning
grid_search.best_params_ # Optimal values for output parameters
  • 1
  • 2

Because of the max_depth parameter we set 5 candidates, and set 5 fold cross validation (cv=5), so for each candidate the model is run 5 times (so the model is run a total of 5*5=25 times), and each candidate gets an average score by 5 fold cross validation, and is sorted according to the average score, and the optimal value of the parameter is as follows:

{'max_depth': 7}
  • 1

That is, for this case, the decision tree is optimal when its depth is set to 7, as its 5-fold cross-validation yields the highest average score.

Supplementary knowledge: batch generation of data required for tuning parameters
In addition, if you do not want to knock the parameter value one number at a time, you can use the () function in section 2.1.3, for example, through the following code, you can construct a data set from 1 to 9, with an interval of 2 ((1, 10, 2) in the first element 1 represents the starting position; the second element 10 represents the termination position, which cannot be retrieved because of the attribute of left-closed-right-open, which can be replaced by the number 11 here as well; (The third element 2 indicates the step length, that is, the interval is 2, if you want to data denser, you can set it to 1).

import numpy as np
np.arange(1, 10, 2)  # Non-Jupyter Notebook editors need the print function to print them out
  • 1
  • 2

This way the 2nd line of code in the previous parameter tuning can be abbreviated as:

parameters = {'max_depth': np.arange(1, 10, 2)}
  • 1

2. Effectiveness test of parameter tuning
In the following, we modeled the model based on the new parameters and verified whether the parameter adjustments improved the effectiveness of the model by looking at the prediction accuracy of the new model and the AUC value of the ROC curve.
The decision tree model is first rebuilt and the training set data is passed into it:

model = DecisionTreeClassifier(max_depth=7)  # Based on max_depth=7Rebuild the model
model.fit(X_train, y_train) 
  • 1
  • 2

Import the data from the test set into the model for prediction and view the overall prediction accuracy via accuracy_score in the Scikit-Learn library:

y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
  • 1
  • 2
  • 3

By printing out the SCORE, it is found that the prediction accuracy of the whole model on the test set is 0.982, i.e., out of 3000 test set data, 2946 people's prediction results match the actual results. The prediction accuracy of the original model on the test set is 0.957, which is now improved to 0.982, indicating that the prediction accuracy of the model has increased after tuning. In fact, the accuracy may also decrease, because we score by the AUC value of the ROC curve when parameter tuning (scoring='roc_auc').
After checking the prediction accuracy, let's check the AUC value of the ROC curve, first by checking the probability that the prediction belongs to each classification by the following code:

y_pred_proba = model.predict_proba(X_test)
  • 1

If you want to simply look at the probability of leaving, you can look at the second column of y_pred_proba, i.e., using the following code:

y_pred_proba[:,1]
  • 1

The following code is then used to quickly find the AUC value of the model:

from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test.values, y_pred_proba[:, 1])
  • 1
  • 2

The AUC value obtained was printed out as 0.987, compared to the original AUC value obtained of 0.973, which proves that the validity of the model did improve after the tuning.

Supplementary knowledge: change in feature importance as the depth of the decision tree increases
After the tuning the depth of the decision tree model increases from 3 to 7, the child and leaf nodes of the tree increase, and the importance of the features may change. Use the following code to see the importance of each feature:

model.feature_importances_
  • 1

The printouts are as follows, corresponding to the feature importance of each of the five feature variables.

array([0.00059222, 0.52728113, 0.13163818, 0.1116004 , 0.07759157,
       0.1512965 ])
  • 1
  • 2

3. Multi-parameter tuning
In addition to single-parameter tuning, GridSearch grid search can also be multi-parameter tuning at the same time, the following we choose DecisionTreeClassifier() model three hyperparameters: max_depth (maximum depth), criterion (feature selection criteria) and min_samples_split ( The minimum number of samples required to split down the child nodes), using the GridSearchCV () method for multi-parameter tuning, the meaning of each parameter can be referred to this section of the Supplementary Knowledge Points: Decision Tree Model hyperparameters.

from sklearn.model_selection import GridSearchCV

# Specify the range of each parameter in the decision tree classifier.
parameters= {'max_depth': [5, 7, 9, 11, 13], 'criterion':['gini', 'entropy'], 'min_samples_split':[5, 7, 9, 11, 13, 15]}
# Build a decision tree classifier
model= DecisionTreeClassifier()  # No need to pass in parameters because of parameter tuning

# Grid search
grid_search= GridSearchCV(model, parameters, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)

# Optimized values for output parameters
grid_search.best_params_
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

The optimal values of the parameters are as follows:

{'criterion': 'entropy', 'max_depth': 11, 'min_samples_split': 13}
  • 1

Therefore criterion is set to information entropy ('entropy'); max_depth is set to 11; and min_samples_split (the minimum number of samples required to split a child node downwards) is set to 13 for the model to be optimal. The model should be introduced with the following writeup:

model = DecisionTreeClassifier(criterion='entropy', max_depth=11, min_samples_split=13)
  • 1

The prediction accuracy at this point is 0.9823; the AUC value of the ROC curve is 0.988, and both values are improved compared to the results obtained from the previous single-parameter tuning, which can be verified by interested readers, and the relevant code is available in the accompanying source code.

Note 1: Difference between multi-parameter tuning and separate single-parameter tuning
There is a difference between multi-parameter tuning and single-parameter tuning, for example, some readers in order to save time, the above three parameters for three separate single-parameter tuning, and then the results will be summarized, such a practice is actually not rigorous. Because in the single-parameter tuning, is the default of the other parameters to take the default value, then the parameter and the other parameters do not take the default value of the case is not taken into account, that is, ignoring the combination of multiple parameters on the model's impact. In the case of the code example above, when using multi-parameter tuning, it is 526 = 60 combinations possible, whereas if 3 single-parameter tunings are performed, there are just 5+2+6 = 13 combinations possible.
Therefore, if only one parameter needs to be tuned, then single parameter tuning can be used, and if more than one parameter needs to be tuned, then multi-parameter tuning is recommended.

Note 2: Parameters take values that are bounded by a given range
Another point to note is that if the parameter values obtained using the GridSearchCV() method are within the boundaries of the given range, then there may be values outside the range that make the model better, so we need to increase the range additionally to continue to adjust the parameters. For example, if the best max_depth value obtained in the above code is the set maximum value of 13, then the actual max_depth that is really appropriate may be larger, and then we need to re-adjust the search grid, such as the search range of max_depth to [9, 11, 13, 15, 17], and re-parameter tuning.

In addition, if you do not want to knock one number after another, you can use the () function in Section 2.1.3, for example, by using the following code, you can construct the dataset from 9 to 17, with an interval of 2. (The first element, 9, indicates the starting position; the second element, 19, indicates the ending position, which is not fetched because of the attribute of left-closed-right-open and can be replaced here with the number 18; the third element, 2, indicates the step size, which is (the interval is 2):

np.arange(9, 19, 2)
  • 1

Supplementary Knowledge 1: Pre- and Post-Pruning of Decision Trees
Here to supplement the explanation of the decision tree model is often ahead of a knowledge point: decision tree pruning, the purpose of decision tree pruning is to prevent the construction of the decision tree overfitting. Decision tree pruning is divided into pre-pruning and post-pruning, both of which are defined as follows:
prune forward: Pruning from the top down, usually using hyperparameters, such as limiting the maximum depth of the tree (max_depth) can be subtracted from the nodes below that maximum depth.
post pruning: Pruning from the bottom up, mostly based on business requirements, e.g., in a default prediction model, the two leaf nodes with 45% and 50% probability of default are considered to be at-risk, so the two leaf nodes are merged into a single node.
In practice, often before the pruning application is more widely used, the above parameter tuning in fact also played a role in some of the former pruning.

Supplementary Knowledge 2: Hyperparameters for Categorical Decision Tree Models
Here are some of the common hyperparameters of the DecisionTreeClassifier() model for classification and their explanations:
:: Feature selection criteria, take the value of "entropy" information entropy and "gini" Gini coefficient, the default choice is "gini".
The default value is "best" and "random", "best" finds the optimal division point among all the division points of the feature, suitable for the case of small sample size, and "random" randomly finds the locally optimal division point among some of the division points, suitable for the case of very large sample size, the default choice is "best". The default choice is "best".
3. max_depth: the maximum depth of the decision tree, take the value of int or None, generally less data or features can not be set, if the data or features are more, you can set the maximum depth of the restrictions. Default is None.
4. min_samples_split: the minimum number of samples needed to split the child nodes downwards, default is 2, if the number of samples in the child nodes is less than this value, then stop splitting.
5. min_samples_leaf: the minimum number of samples for a leaf node, default is 1, if less than this value, the leaf node will be pruned along with its siblings (i.e., cull the leaf node and its siblings, and stop splitting).
6. min_weight_fraction_leaf: the smallest sample weight sum of the leaf node, the default is 0, i.e., do not consider the weight problem, if less than this value, the leaf node will be pruned together with its brother nodes (i.e., eliminate the leaf node and its brother nodes, and stop splitting). If the larger sample has missing values or the distribution category of the sample deviates a lot, the sample weight problem needs to be considered.
7. max_features: the maximum value of the number of feature values to be considered when dividing the nodes, default is None, you can pass in int-type or float-type data. If it is float type data, it means percentage.
8. max_leaf_nodes: the maximum number of leaf nodes, default is None, you can pass in int data.
9. class_weight: specify the class weight, the default is None, can be taken as "balanced", on behalf of the sample size of the category corresponding to the small sample weight is higher, can also be passed into the dictionary to specify the weight. This parameter is mainly used to prevent the training set from having too many samples of certain categories, which may cause the trained decision tree to be too biased towards these categories. In addition to specifying class_weight here, you can also use oversampling and undersampling to deal with the problem of sample category imbalance, oversampling and undersampling will be explained in Chapter 11: Data Preprocessing.
10. random_state: when the amount of data is large, or more feature variables, may be divided in a node, will run into two feature variables of the information entropy gain or Gini coefficient reduction is the same, then the decision tree model default is randomly selected from a feature variable for the division, which may lead to each time you run the program after the generation of decision trees are not quite the same. If you set the random_state parameter (e.g., 123), you can ensure that every time you run the code, the split result of each node is the same, which is more important when there are more feature variables and the depth of the tree is deeper.

Supplementary Knowledge 3: Application of Tree Models in the Financial Big Data Risk Control Domain
In addition to the employee departure prediction model, in the field of financial big data risk control, the decision tree model also has a lot of application space, in order to the bank's credit default prediction model, for example, usually use logistic regression model as well as the decision tree model.
Advantages and disadvantages of logistic regression models: not too many variables, not easy to overfitting, better generalization ability, may only change the model once a year, but logistic regression models are sometimes not precise enough to effectively weed out potential defaulters.
Advantages and disadvantages of tree models (decision trees, random forests, XGBoost and other models): less stable (a variable can be used over and over again), prone to overfitting, weak generalization ability, after a period of time to change the wave may not work, but it is a strong fit, high differentiation, can quickly remove the bad guys.
Therefore, in practice, logistic regression-based scorecard models are used as the basis (stable, updated once every six months to one year, but not precise enough, and the KS value is not large enough) + tree models such as decision trees (less stable, may have to be updated once a month, but the fit is strong, the differentiation is high, and the bad guys can be removed quickly in the first wave).

To summarize, the decision tree model, as a classic algorithmic model for machine learning, has its unique advantages, such as its insensitivity to outliers, strong interpretability, etc., but it also has some disadvantages, such as unstable results, easy to cause overfitting and other problems. What's more, the decision tree model is the basis of many important integrated models, such as the integrated models of Random Forest Model, AdaBoost Model, GBDT Model, XGBoost and LightGBM Model, which are all built on the basis of the decision tree model, so the decision tree model must be mastered properly.
在这里嗲就插入图片描述
Click to download.

5.3.3 Pre- and Post-Pruning of Decision Trees

prune forward: Use parametric pruning to work from the top down, e.g., limiting the maximum depth of the tree, setting thresholds, etc.
Pre-pruning is the process of tree construction (only the training set is used) where a threshold is set (the number of samples is less than a predetermined threshold or the GINI index is less than a predetermined threshold) such that when the pre-split and post-split error in the current split node exceeds this threshold it is disaggregated, otherwise no split operation is performed. Example.
在这里插入图片描述
This is the decision tree model we constructed earlier, suppose we set the number of nodes <150 to stop splitting, then this node will stop continuing to split and exist as a leaf node during subsequent growth.
Construct a new decision tree model and specify that the minimum value of the node samples is 150. All other parts of the code are the same as before and will not be shown here.

model = DecisionTreeClassifier(min_samples_split=150)
  • 1

After visualizing the decision tree model, only a part of it is intercepted here because the structure of the decision tree generated after setting the parameters is too large.
在这里插入图片描述
We can see that in the first figure we circled a node that has existed as a leaf node in the subsequent growth without continuing to generate new children down the line because of the threshold set for the minimum number of samples for the node. Also we find that in the fourth layer there are also some nodes with a sample number less than the threshold that stopped splitting.

post pruning: Bottom-up pruning is the modification of a decision tree model after it has been generated. We explain a pessimistic pruning method (Pessimistic Error Pruning, PEP) in detail here.
PEP pruning is replacing a subtree with a leaf node (that is, with the root of the subtree).
The PEP algorithm first determines the empirical error rate (EMPIRICAL) of this leaf as (E+0.5)/N, with 0.5 being an adjustment factor. For a subtree with L leaves, then the number of errors and instances of the subtree are both should be the result of summing the number of errors and instances of the leaves, thenThe error rate of the subtree is e
在这里插入图片描述
No can a leaf node be used to replace a subtree, and a condition has to be satisfied, A is the number of misjudgments of the replacement node we want to use, and after that 0.5 has to be added to the final replacement criterion, the
在这里插入图片描述
included among these在这里插入图片描述is the standard deviation, because in the process of constructing the model, the number of errors in the subtree is a random variable that can be approximated as a binomial distribution, it is possible to calculate the standard deviation according to the standard deviation formula of the binomial distribution, and it is possible to determine whether the branch should be cut. There are N instances in the subtree, that is, N trials are conducted, and the probability of each experiment's error is e, which conforms to the binomial distribution of B (N, e), and according to the formula, the mean is Ne, the variance is Ne (1-e), and the standard deviation is the open square of the variance.
Let's take an example.
在这里插入图片描述
We want to try to replace the entire subtree with the root node T1 of this subtree shown above, each node has two data, from left to right the number of correct and the number of incorrect node sample classifications. This subtree has three leaf nodes, i.e., L = 3. The subtree has 16 data, N = 16, of which 4 are correct and 12 are incorrect, and the subtree error rate is calculated as
在这里插入图片描述
Calculate standard deviation在这里插入图片描述在这里插入图片描述, the number of errors in the subtree在这里插入图片描述is the number of errors for all leaves.

在这里插入图片描述
If the subtree is replaced with the root node, then, the
在这里插入图片描述

because of the fulfillment of在这里插入图片描述condition, so we can replace this subtree with the root node.

5.4 Case Practice: Bank Customer Default Prediction Modeling

In this section, we will learn the application of decision tree model in finance through the construction of customer default prediction model, and will explain some methods to measure the advantages and disadvantages of the prediction effect of a model, and finally will present the decision tree model through visualization.

5.4.1 Model construction

The purpose of the customer default prediction model is to build a suitable model by using the existing customer information and default performance, so as to predict whether the customer will default afterward. Firstly, we read the customer's certificate credit data and its transaction performance, i.e., whether it defaults on the record or not, through the knowledge related to reading data from the pandas library in Section 6.2.2, and the code is as follows:

import pandas as pd
df = pd.read_excel('Customer Information and Default Performance.xlsx')
  • 1
  • 2

The results are shown in the following table, in which there are 1000 sets of historical data, of which the first 400 are default customers' data and the last 600 are non-default customers' data. Because Python mathematical modeling can not recognize the text content, so the "gender" and "whether the default" column has been numerical processing of the content, in which the "gender" column In the "Gender" column, 0 means male and 1 means female, and in the "Default or not" column, 0 means no default and 1 means default. Our goal is to build a decision tree model based on these historical data to predict the likelihood of subsequent customer defaults.
在这里插入图片描述
In which whether to default as the target variable, the remaining fields as feature variables, through a borrowing customer's characteristics to determine whether he will default. Here in order to facilitate the demonstration, only five feature variables were selected, in the commercial practice, the use of feature variables is much more than the case here. The following is the decision tree model construction, but also most of the regular steps in the construction of machine learning models.

1. Extract feature and target variables
The feature variables and target variables are extracted separately by the following code which is given below:

X = df.drop(columns='Whether in default') 
y = df['Whether in default']  
  • 1
  • 2

The drop() function described in subsection 6.2.3 removes the column "whether default" and assigns the remaining data as a feature variable to the variable X. Here is another way to remove a column: ('whether default', axis=1), where axis=1 means processing by column. Then extract the column "whether default" as the target variable through the DataFrame column extraction method, and assign the value to variable y.

df.columns

X = df[['Income', 'Age', 'Gender', 'Historical credit limits', 'Number of historical defaults']]
X
  • 1
  • 2
  • 3
  • 4

2. Divide the training set and test set

X_train = X[0:800]
y_train = y[0:800]

X_test = X[800:]
y_test = y[800:]
  • 1
  • 2
  • 3
  • 4
  • 5

After extracting the feature variables, we need to split the original 1000 data into a training set and a test set. As the name suggests, the training set is used for training and the test set is used to test the results of training.
Usually we divide the training set and test set according to the size of the sample size, when the sample size is large, we can divide a little more proportion of data to the training set, for example, when there are 100,000 sets of data, we can set a ratio of 9:1 to divide the training set and test set. Here there are 1,000 data, which is not much, so divide the training set and test set by the ratio of 8:2.
The code for dividing the training and test sets is as follows:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
  • 1
  • 2

The first line of code introduces the train_test_split() function from the Scikit-Learn library. The second line of code uses the train_test_split() function to divide the training set into test sets, where X_train and y_train are the training set data, and X_test and y_test are the test set data. The first two parameters of the train_test_split() function, X and y, are the previously extracted feature variables and target variables; test_size is the proportion of data in the test set, which is 20% of the data, i.e., 0.2. The data is randomly divided, and we can print out the divided data as shown in the following figure:
在这里插入图片描述
Because the train_test_split() function divides the data randomly every time the program is run, if you want the content produced by each division of the data to be the same, you can set the random_state parameter with the following code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
  • 1

The number 1 of random_state has no special meaning, it can be replaced by another number, it is just equivalent to a seed parameter, making it so that each time the data is divided, the content is consistent.

3. Model training and construction (core)
After dividing into training set and test set, the decision tree model can be introduced from Scikit-Learn library for model training, the code is as follows:

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X_train, y_train) 
  • 1
  • 2
  • 3

The first line of code introduces a classification decision tree model from the Scikit-Learn library: DecisionTreeClassifier. the second line of code assigns the decision tree model to the clf variable, and sets the model parameter max_depth to 3, i.e., the maximum depth of the tree is 3. The concept of this model parameter is also explained in subsection 16.3.3. The third line of code performs the training of the model through the fit() method, where the incoming parameter is the training set data obtained in the previous step.
Up to this point.A decision tree model has been built., here to summarize the decision tree model building code, you can see that the code itself is not complex, the main need to understand the principles behind it, so that the model can be built in mind.

import pandas as pd
df = pd.read_excel('Customer Information and Default Performance.xlsx')
# 1.Extracting feature and target variables
X= df.drop(columns='Whether in default') 
y = df['Whether in default']   
# 2.Divide the training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# 3.Model Training and Construction
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X_train, y_train)    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Once the model is built it can be utilized to make predictions, at which point the previously divided test set can come into play, and we can utilize the test set to make predictions and evaluate the model's predictive effectiveness.

5.4.2 Model predictions and assessment

This subsection will describe how to directly predict whether or not to default, as well as predict the probability of not defaulting & defaulting, and will conclude with how to reasonably evaluate a model.
1. Direct prediction of non-compliance

y_pred = clf.predict([[0, 10, 1, 10, 1]])
y_pred
  • 1
  • 2
y_pred = clf.predict([[1000000000000, 30, 1, 100000, 0], [1000000000000, 10, 1, 100000, 0]])
y_pred
  • 1
  • 2

The purpose of building the model is to use it to predict the data, here the data from the test set is imported into the model to make predictions, the code is as follows, where clf is the decision tree model built in the previous section.

y_pred = clf.predict(X_test)
  • 1

The predicted y_pred is shown below, with 0 and 1 being the predicted outcome, 0 being the prediction that there will be no default and 1 being the prediction that there will be a default.
在这里插入图片描述
Using subsection 6.2.1 to create DataFrame related knowledge points, the predicted y_pred and the test set of the actual y_test aggregated together, where y_pred is a one-dimensional array structure, y_test for the Series one-dimensional sequence structure, so here are used list () function will be converted to a list, the code is as follows:

a = pd.DataFrame()  # Create an empty DataFrame
a['Predicted value'] = list(y_pred)
a['Actual value'] = list(y_test)
  • 1
  • 2
  • 3

With print(()) you can print out the last five lines of the generated DataFrame as follows.
在这里插入图片描述
It can be seen that all the data in the test dataset is predicted here and the last five sets of data have a prediction accuracy of 80%, if you want to see the overall prediction accuracy, you can use the following code:

from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
  • 1
  • 2

By printing out the SCORE printout, it was found that the prediction accuracy of the whole model on the test set was 0.825, i.e., out of the 200 test set data, 165 of the predicted results matched the actual results.

2. Prediction of non-compliance & probability of non-compliance
In fact, the categorical decision tree model essentially predicts not the exact 0 or 1 classification, but the probability of its belonging to a certain classification, you can view the predicted probability of belonging to each classification by the following code:

y_pred_proba = clf.predict_proba(X_test)
  • 1

At this point, the y_pred_proba is the predicted probability of belonging to each category, which is a two-dimensional array, the following table shows the last five data sets of non-default & default probability. The first column of data is predicted for the first category of outcome 0, that is, the probability of non-default, the second column of data is predicted for the second category of outcome 1, that is, the probability of default, the sum of these two probabilities is one.
在这里插入图片描述
Previously, when directly predicting whether the default, the essence is actually to see which classification belongs to the largest probability, such as the last line of data, in which the probability of default is 0.75, greater than the probability of not default 0.25, so that it is predicted to default. The default binary classification problem is to 0.5 as the threshold to predict which class it belongs to, because if the probability of a class is greater than 0.5, the probability of the class is necessarily greater than the other class. Practical applications can also adjust the threshold as needed, such as setting as long as the probability of default is greater than 0.3, it is considered that the user will default.
Some careful readers may have found that some of the above probabilities are the same, for example, the probability of non-default of the two customers in the middle are 0.56, this in the next subsection will be presented in the model visualization you will understand the principle of calculation of these probabilities.
If you want to simply view the default probability, i.e., view the second column of y_pred_proba, you can use the following code, this is a two-dimensional array of selected columns, where ":" before the comma indicates all rows, and the number 1 after the comma indicates the second column, if you change the number 1 to the number 0, you will extract the first column of non default probability.

y_pred_proba[:,1]
  • 1

3. Assessment of the effectiveness of model predictions
Accuracy has been previously utilized to measure the predictive effectiveness of models, although accuracy is generally not used as a criterion for evaluating models in business practice because accuracy is many times unreliable. As an example.If 10 out of 100 customers default, and if the model predicts that all customers will not default, although the model does not filter out a single defaulting customer, the model's prediction accuracy can still reach 90%, obviously this higher accuracy does not reflect the model's strengths and weaknesses. In business practiceWe are more interested in the following two indicators.
在这里插入图片描述
The meanings of TP, FP, TN, and FN are shown in the following table, which is also called the confusion matrix.
在这里插入图片描述
The true rate is calculated over allActual defaultOf the people whoProjected to be in default(the hit rate of catching bad guys), also known as the hit rate or recall rate, while the false positive rate is calculated over allNo actual violationsAmong the people whoProjected to be in default(false alarm rate for false good guys), also known as the false alarm rate.
Take the example mentioned above as an example, 10 out of 100 customers default, the model predicts that all customers will not default, as shown in the table below, then the model's False Positive Rate (FPR) is 0, i.e., no one good person was mistakenly injured, but at this point the model's True Rate (TPR) is also 0, i.e., no one bad person was uncovered, so then at this point it is understood that even as high as 90% of the prediction accuracy is meaningless.
在这里插入图片描述
A good customer default prediction model, we hope that the hit rate (TPR) is as high as possible, that is, to uncover as many bad guys as possible, and also hope that the false alarm rate (FPR) can be as low as possible, that is, not to mistakenly hurt the good guys. However, the two tend to be positively correlated, because once the threshold is raised, for example, if the default rate is considered to be over 90%, then the false alarm rate will be very low, but the hit rate is also very low; and if the threshold is lowered, for example, if the default rate is considered to be over 10%, then the hit rate will be very high, but the false alarm rate will be very high as well. So to measure the strengths and weaknesses of a model, data scientists plot the following graph, called the ROC curve, based on the hit rate and false alarm rate at different thresholds.
在这里插入图片描述
The horizontal coordinate of the ROC curve is the false alarm rate (FPR), and its vertical coordinate is the hit rate (TPR), which we would like to have as high as possible and as low as possible for a given threshold condition.
For example, the total number of samples for a certain test is 100, of which 20 customers default, when the threshold is 20%, that is, the probability of default is more than 20% of the time that the user is considered to be defaulted, model A and model B predicted that the defaulted customers are 40 people. If model A predicts default of 40 people in 20 people do default, then the hit rate of 20/20 = 100%, this time the false alarm rate of 20/80 = 25%, if model B predicts default of 40 people in 15 people do default, then its hit rate of 15/20 = 75%, false alarm rate of 25/80 = 31.25%. Model A is then considered a superior model at this point. Therefore, for different models, we would like to have a higher hit rate and a lower false alarm rate for the same threshold conditions.

If the false alarm rate is interpreted as the cost, then the hit rate is the benefit, so it can also be said that under the same threshold, we hope that the false alarm rate (cost) is as small as possible, the hit rate (benefit) is as high as possible, and this idea is reflected in the graph that the curve is as steep as possible, the closer the curve is to the upper left corner, indicating that in the same conditions at the same threshold, the higher the hit rate, the smaller the false alarm rate, and the more complete the model. The smaller the hit rate, the better the model. To put it in another way, a perfect model is one in which the false alarm rate (FPR) is close to 0 and the hit rate (TPR) is close to 1 at different thresholds, which is reflected in the graph by the fact that the curve is very close to (0, 1), i.e., the curve is very steep.

Numerical comparisons can be used to measure the AUC value of the model is good or bad, AUC value (Area Under Curver) refers to the area under the curve, the area of the value of the range is usually 0.5 to 1, 0.5 means random judgment, 1 represents the perfect model, in the commercial practice, because there are many perturbation factors, the AUC value of 0.75 or more would be acceptable, and if it can reach 0.85 or more, then it is a very good model. If it can reach more than 0.85, it is a very good model. In the Python implementation, the following code can be used to find out the false alarm rate (FPR) and the hit rate (TPR) under different thresholds, so that the ROC curve can be plotted.

from sklearn.metrics import roc_curve
fpr, tpr, thres = roc_curve(y_test.values, y_pred_proba[:,1])
  • 1
  • 2

Where the first line of code introduces the roc_curve() function. The second line of code passes in the value of the target variable y_test of the test set, and the predicted probability of default, and calculates the false alarm rate and hit rate under different thresholds by the roc_curve() function, and assigns the three to the variables thres, fpr, and tpr, at which time thres, fpr, and tpr are obtained as three one-dimensional arrays.
The three can be merged into a two-dimensional data table with the following code:

a = pd.DataFrame()  # Create an empty DataFrame
a['Threshold'] = list(thres)
a['False alarm rate'] = list(fpr)
a['Hit rate'] = list(tpr)
  • 1
  • 2
  • 3
  • 4

At this point it is possible to view the false alarm and hit rates at different thresholds, as shown in the table below:
在这里插入图片描述
It can be seen that the higher the threshold, the lower the false alarm rate, but the corresponding hit rate decreases.
Knowing the false alarm rate and hit rate at different thresholds, one can pass subsection 7.2data visualizationRelated Knowledge Points Plot the ROC curve with the following code:

import matplotlib.pyplot as plt
plt.plot(fpr, tpr)
plt.show()
  • 1
  • 2
  • 3

The plotted ROC curve is shown below:
在这里插入图片描述
Instead, the AUC value of the model can be quickly found by using the following code:

from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test.values, y_pred_proba[:,1])
  • 1
  • 2

The first line of code in this case introduces the roc_auc_score() function. The second line of code passes in the value of the test set target variable y_test, and the predicted probability of default. The AUC value will be obtained and printed out as: 0.846, which can be said that the prediction is still good.

5.4.3 Visual presentation of the model

If you want to visualize the decision tree model, you can use the graphviz plugin for Python, and the figure below shows the above trained decision tree model. Because the visualization of the model is mainly for demonstration and teaching, and will not be used much in the real combat, so for the installation and use of graphviz, interested readers can check or refer to the following URL: /docs/lUYMJX0TEjoncFZk /.

在这里插入图片描述

It can be seen that there are only 3 layers of tree structure from the initial node downwards, which is also the model parameter max_depth, i.e. the maximum depth of the tree, set in subsection 16.3.1. The reason for setting max_depth is to facilitate the demonstration, and the other reason is that if the depth of the tree is too deep, it will lead to overfitting of the model and a decrease in the prediction effect. Overfitting refers to a model that is overfitted in the training samples, resulting in poor performance in the test dataset. The opposite is underfitting, where underfitting means that the model does not fit well and the data is far from the fitting curve, or that the model does not capture the data features well enough to fit the data well. The concepts of overfitting and underfitting can be more intuitively understood through the figure below.
在这里插入图片描述

Here again, the meaning of the content in each node in the figure, in addition to the leaf nodes, each node has four elements: split basis, gini (current Gini coefficient), samples (the current number of samples), value (the number of categories in the sample). Take the root node as an example, its split basis is based on whether the number of historical defaults is less than 0.5 as the basis for splitting, its current Gini coefficient is 0.482, and the total number of samples contained in it is 800, in which the value of the left value of 475 means "whether default" in the 0, that is, no default customers, and the right value of 325 means "whether the default" in the 1, that is, default customers.
After the root node splits, two sub-nodes are created, where the left sub-node contains mostly non-defaulting customers (484 in total, of which 381 are non-defaulting and 103 are defaulting), while the right node contains mostly defaulting customers (316 in total, of which 94 are non-defaulting and 222 are defaulting), which is indeed in line with the reality that defaulting is less likely to occur if the number of defaults in the history is low. The Gini coefficient of the system after the root node split is calculated from the knowledge points in subsection 16.2.2 as 484/8000.335 + 316/8000.418 = 0.3678. This is also the optimal solution obtained by the machine through non-stop training and computation, and the Gini coefficient of the system will be bigger than this one if the root node is split in other ways. The above point can be verified by calculating the feature importance of the feature variables with the following code:

clf.feature_importances_
  • 1

The printouts are as follows, corresponding to the feature importance of each of the five feature variables.

array([0.        , 0.35084118, 0.        , 0.15794548, 0.49121334])
  • 1

It can be seen that the last characteristic variable, the number of historical defaults, has the highest importance here. This also explains why the first characteristic variable: income and the third characteristic variable: gender have a characteristic importance of 0. This is because these two factors do not play a role in the model. This can also be seen in the visualized graphs, where each forked node is not split based on these two feature variables, so it is said that these two feature variables do not play a role. If the maximum depth of max_depth is set to be larger, so that the decision tree can continue to split downwards, then these two feature variables may play a role, and the feature importance is no longer 0. Some readers may also wonder how the feature importance of income is not as high as age, this is because the data selected in this case is from the quality customers, i.e., the group of higher incomes, and there is no income gap between defaulted and non-defaulted customers, so income and non-defaulted customers are not as high as age. This is because the data in this case was selected from premium customers, i.e., the higher income group, and the difference in income between defaulting customers and non-defaulting customers is not very large, so the characteristic importance of income is not as high.
Some of the final leaf nodes have already finished splitting, for example, the Gini coefficients of the two leaf nodes in the lower left and lower right corners are 0, which means that these two leaf nodes already have the highest purity (i.e., all the elements in them are of the same class), and do not need to be split any further. Some of the Gini coefficients have not yet reached 0, because the maximum depth of the tree is limited to 3, so they will not continue to split downward. Leaf nodes do not have a basis for splitting because they do not need to continue splitting. The calculation of the probability of non-default & default mentioned in subsection 16.3.2 is based on the leaf node, if it is split to the leaf node in the bottom right corner, then its probability of non-default is 0 and the probability of default is 100%, if it is split to the leaf node to the left of that leaf node, then its probability of non-default is 381/443=0.86, and the probability of default is 62/443=0.14 and so on for the rest. Then its probability of non-default is 82/147 = 56, its probability of default is 65/147 = 0.44, and its probability of default for leaf nodes further to the left is 21/28 = 0.75, and so on for the rest. In addition, interested readers can calculate the default probabilities reflected by the remaining leaf nodes, and then observe the thresholds used in the previous subsection to plot the ROC curve, and will find that the thresholds used in the previous subsection are not chosen arbitrarily, but are the default probabilities reflected by these different leaf nodes. the ROC curve is plotted using the values of these default probabilities as the thresholds to see the hitting rates under different thresholds. (TPR) and false alarm rate (FPR ) at different thresholds.
Through this figure, we will be able to better understand the logic of the decision tree, when a new data comes, it will be judged from the top root node, if it meets the historical number of defaults <=0.5 is divided into the left node for a series of judgments after, if it does not meet the node to the right for a series of judgments after the node, and finally this new data will be divided into one of the leaf nodes, thus completing the prediction of the data.
Decision tree model as a classic algorithmic model of machine learning, has its unique advantages, such as its insensitivity to outliers, strong interpretability, etc., but it also has some shortcomings, such as the results are not stable, easy to cause overfitting and other issues. So in the commercial practice often use an integrated algorithm model based on the decision tree model: Random Forest model, Random Forest is built by multiple decision tree model together, the results are more stable and not easy to overfitting. Due to the length of this book, the random forest model is not in this book to explain in detail, the specific content can refer to the author's next book, "Python Big Data Analytics and Machine Learning Business Cases".