linear regression model
- 3.1 Univariate linear regression
- 3.1.1 Mathematical Principles of Univariate Linear Regression
- 3.1.2 Code Implementation of Univariate Linear Regression
- 3.1.3 Case Study: Linear Regression Modeling of Years of Service and Earnings in Different Industries
- 3.2 Linear regression model assessment
- 3.2.1 Programming implementation of model evaluation
- 3.2.2 Mathematical principles of model assessment
- 3.3 Multiple linear regression
- 3.3.1 Mathematical principles and code implementation of multiple linear regression
- 3.3.2 Case Study: Customer Value Prediction Modeling
- 3.3.3 Feature Importance Mining
- Supplementary: 3.3.4 Delineation of test set and training set data
- Solve regular equations
- Fitting data and finding function expressions
- 2.1 Mode 1: Polynomial fitting
- 2.2 Approach 2: Specific function fitting
- Practice problem: Python four-parameter logistic fitting
- Exercise: Model Evaluation
- 3.4 Course-related resources
This chapter focuses onmachine learningThe very basic and classic linear regression models in theone-dimensional linear regressionIt also introduces two classic cases: the revenue forecasting model and the customer value forecasting model in different industries to consolidate the knowledge points learned.
Click to download.
3.1 Univariate linear regression
Linear regression modeling is the use of linear fitting to explore the laws behind the data, as shown in the figure below, is to build a linear regression model to find the trend line (also known as the regression curve) behind these scattered points (also known as the sample points), and through this regression curve we will be able to carry out some simple predictive analytics or causality analysis.
In linear regression, we make predictions about the response variable (also known as the dependent variable) based on the characteristic variables (also known as the independent variables), and based on the number of characteristic variables, linear regression models can be categorized into univariate linear regression andmultiple linear regression (MLR)For example, the prediction of income by one characteristic variable: years of working experience is a univariate linear regression, while the prediction of income by multiple characteristic variables: years of working experience, industry, city, etc. is a multivariate linear regression. This subsection explains the one-way linear regression model first.
3.1.1 Mathematical Principles of Univariate Linear Regression
The one-dimensional linear regression model, also known as the simple linear regression model, can be expressed in the form of the following equation:
y = a*x + b
- 1
where y is the dependent variable, x is the independent variable, a denotes the regression coefficient, and b denotes the intercept. As shown in the following figure, whereis the actual value.is the predicted value, the purpose of a one-dimensional linear regression is to fit a line to make the predicted value as close as possible to the actual value, and if most of the points fall on the fitted line, then the linear regression model is a good fit.
So how do you measure how close the actual value is to the predicted value? Mathematically, we measure it by the sum of the squares of the difference between the two (the sum is also known as the sum of the squares of the residuals), which is given by the following formula, whereis the summation symbol. Also, as a side note, in the field of machine learning, this residual sum of squares is also called the loss function of a regression model (more points related to the loss function can be found in the Supplementary Knowledge Points in Chapter 9 of this book).
Obviously we want this sum to be as small as possible, so that the actual value and the predicted value are closer, and the mathematical way to find the minimum value is to find the derivative, when the derivative is 0, the residual sum of squares is the smallest. We will change the residual sum of squares to a different expression, the formula is as follows:
Then through the residuals of the sum of squares for the derivatives, and then make the derivative of 0 can be obtained from the coefficient of linear regression model a and the intercept b, this is the mathematical principle of linear regression, academically known as the method of least squares, specific formulas for the derivatives of the section can be referred to here for additional knowledge. In Python, there are specialized libraries to solve the coefficient a and intercept b, without the need for us to calculate complex mathematical formulas, which we will explain in the next section.
Supplementary Knowledge: Least Squares Demonstration
We take the following data as an example to demonstrate the least squares method for interested readers:
Assuming that the fitting equation for the linear regression model is y = ax + b, the residual sum of squares or loss function is defined as:
The purpose of fitting is to make the sum of squared residuals as small as possible, i.e., the true and predicted values are as close as possible. According to the relevant points of higher mathematics to find the extreme value, through the residuals of the sum of squares of the derivatives (a and b), the derivative of 0, the residuals of the sum of squares of the extreme value, then you can get the coefficients of the fit a and intercept b.
y = x^2
y’ = 2x
y’ = 0 -> x = 0
Based on knowledge of complex derivatives in higher mathematics:again becauseThe derivative of is 2x, so theThe derivative isDerivation of a and b, respectively, yields the following equation:
f = (2x+1)^2
f = 2(2x+1)*2
Substituting the values of x,y and dividing both sides of the equal sign by 2 gives:
Keep simplifying, gotta:
It is easy to solve for
That is to say, the fitting curve is y = x + 2, you can find a perfect fit of three points, interested readers can verify that the sum of squares of the residuals at this time does take the minimum value, and 0. In addition, for the multiple linear regression, the coefficients and the intercept of the derivation of the method and the process of the above is also consistent with the above, only that it has become the set of multivariate equations only.
3.1.2 Code Implementation of Univariate Linear Regression
With Python'sScikit-learnThe library allows us to easily build a one-dimensional linear regression model. The Scikit-learn library is already installed by default if you installed Python via Anaconda, which we talked about earlier. Here's a simple example of how to build a one-dimensional linear regression model in Python.
1. Plotting scatterplots
The first step is to first plot a few scatter points through the Matplotlib library that you have learned before, the code is as follows:
import matplotlib.pyplot as plt
X = [[1], [2], [4], [5]]
Y = [2, 4, 6, 8]
plt.scatter(X, Y)
plt.show()
- 1
- 2
- 3
- 4
- 5
There is a small note here, in which the set of independent variables X needs to be written in the form of a two-dimensional structure, i.e., a large list contains a small list, this is actually in line with the logic of the subsequent multiple regressions, because with multiple regressions, a dependent variable y may correspond to more than one independent variable x, for example, for thecubic linear regression(i.e., there are three characteristic variables), at which point the set of independent variables X needs to be written in a form similar to the following:
X = [[1, 2, 3], [2, 4, 5], [4, 6, 8], [5, 7, 9]]
- 1
As for the set of dependent variables Y it is sufficient to write the structure in one dimension, at which point the scatter is shown below:
2. Introduce Scikit-learn library to build the model
With the raw data, we can quickly build a linear regression model by introducing the Scikit-learn library with the following code:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
- 1
- 2
- 3
The first line of code from the Scikit-learn library to introduce linear regression LinearRegression module, the second line of code is to construct an initial linear regression model and named regr, the third line of code through the fit () function to complete the model building, at this time the regr is already a well-built linear regression model.
3. Model predictions
regr is already a well-built model, at this point you can use the model to predict the data, for example, the independent variable is the number 1.5, then through the predict () function will be able to predict when the independent variable x = 1.5 when the corresponding dependent variable y, the code is as follows:
y = regr.predict([[1.5]])
- 1
Meaning that the independent variables here still have to be written in the form of a two-dimensional structure, the principle and the previous plotting of scatter plots written as a two-dimensional structure of the data is similar to the prediction results obtained at this time y is shown below, at this time to obtain y for a one-dimensionalarrays。
[2.9]
- 1
In addition, if you want to predict more than one independent variable at the same time, you can use the following code:
y = regr.predict([[1.5], [2.5], [4.5]])
- 1
The projections at this point are as follows:
[2.9 4.3 7.1]
- 1
4. Model visualization
We can also display the built model visually with the following code:
plt.scatter(X, Y)
plt.plot(X, regr.predict(X))
plt.show()
- 1
- 2
- 3
At this time, the effect is shown in the figure below, you can see that at this time the univariate linear regression model is formed in the middle of this straight line, the principle behind it is the least squares method mentioned in subsection 3.1.1 .
5. Linear regression equation construction
The coefficients and intercepts of the line at this point can be obtained by coef_ and intercept_ with the following code:
print('The coefficient a is:' + str(regr.coef_[0]))
print('The intercept b is:' + str(regr.intercept_))
- 1
- 2
Here through regr.coef_ to get is a list, so you have to regr.coef_[0] to select the elements, and because the elements for the number, so the string splicing need to use str () function to convert to a string, the results are as follows:
The coefficient a is:1.4000000000000004
The intercept b is:0.7999999999999989
- 1
- 2
Then the linear regression equation obtained from the one-way linear regression at this point can be expressed in the following form:
y = 1.4*x + 0.8
- 1
3.1.3 Case Study: Linear Regression Modeling of Years of Service and Earnings in Different Industries
After understanding the basic mathematical principles of one-dimensional linear regression model and conventional code implementation, we look at a specific combat case: through the one-dimensional linear regression model to build revenue forecasting model.
1. Case background
Generally speaking, income increases with the number of years of service, and the rate of income growth varies in different industries. This subsection is to explore the effect of years of service on income through a one-way linear regression model, i.e., to build an income prediction model, and at the same time to compare the income prediction models of several industries to analyze the characteristics of each industry.
2. Read data
Here we first take the current relatively hot IT industry as an example, here is a selection of the IT industry in Beijing, the distribution of age in the IT industry in the 0-8 years of the 100 IT engineers monthly salary, through the following code to read the data, which () is used to show the first five lines of data.
import pandas
df = pandas.read_excel('IT Industry Income Statement.xlsx')
df.head() # Non-Jupyter Notebook editors need to pass theprint()function printing
- 1
- 2
- 3
The data at this point is as follows:
At this point in time, length of service is the independent variable and salary is the dependent variable, and the independent and dependent variables are selected through the following code:
X = df[['Length of service']]
Y = df['Salary']
- 1
- 2
Here the independent variable X must be written as a two-dimensionaldata structure, for the reasons mentioned earlier; and because the variable Y can just be written as a one-dimensional data structure, although it is actually written as a two-dimensional array structure df[['salary']], and the model works after that.
The scatterplot can be plotted at this point with the following code:
import matplotlib.pyplot as plt
plt.rcParams['-serif'] = ['SimHei'] # Use to display Chinese labels normally
plt.scatter(X,Y)
plt.xlabel('Length of service')
plt.ylabel('Salary')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
The 2nd line of code is used to display Chinese, SimHei means bold font. Then add axis labels through () and (), and finally run the effect as shown below:
3.Model building
According to the knowledge points in subsection 3.1.2, the linear regression model can be built by the following code:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
- 1
- 2
- 3
4. Model visualization
According to the knowledge points in subsection 3.1.2, the linear regression model can be presented visually by the following code:
plt.scatter(X,Y)
plt.plot(X, regr.predict(X), color='red') # color='red'Set to red
plt.xlabel('Length of service')
plt.ylabel('Salary')
plt.show()
- 1
- 2
- 3
- 4
- 5
At this point the running effect is shown in the following figure:
5. Linear regression equation construction
We can also check the slope factor a and intercept b of the line by using the knowledge points from the previous subsection, as in the following code:
print('The coefficient a is:' + str(regr.coef_[0]))
print('The intercept b is:' + str(regr.intercept_))
- 1
- 2
The results of the run are as follows:
The coefficient a is:2497.1513476046866
The intercept b is:10143.131966873787
- 1
- 2
So the equation of the one-way linear regression curve at this point is:
y = 2497*x + 10143
- 1
The complete code is shown below:
# 1.Read data
import pandas
df= pandas.read_excel('IT Industry Income Statement.xlsx')
print(df.head())
X = df[['Length of service']]
Y = df['Salary']
# 2.Model Training
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
# 3.Model Visualization
from matplotlib import pyplot as plt
plt.scatter(X,Y)
plt.plot(X, regr.predict(X), color='red') # color='red'Set to red
plt.xlabel('Length of service')
plt.ylabel('Salary')
plt.show()
# 4.Linear regression equation construction
print('The coefficient a is:' + str(regr.coef_[0]))
print('The intercept b is:' + str(regr.intercept_))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
Supplementary Knowledge: Model Optimization - Univariate Multiple Linear Regression Models
For the univariate linear regression model, it actually has an advanced version called the univariate multiple linear regression model, more commonly known as the univariate quadratic linear regression model, which has the following format:
y = a*x^2 + b*x + c
- 1
The reason we will also look at the one-variable multiple linear regression model is because sometimes the trend line that really fits may not be a straight line but a curve, for example, the curve formed from the one-variable quadratic linear regression model in the chart below fits the trend behind the scatterplot much better.
So how do you build a one-dimensional quadratic linear regression model by way of code, starting by generating the quadratic term data with the following code:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_ = poly_reg.fit_transform(X)
- 1
- 2
- 3
The first line of code introduces the module that adds a multi-item content: PolynomialFeatures;
The 2nd line of code sets the highest ordinal term to be a quadratic term, in preparation for generating x^2 below;
Line 3 of the code transforms the original X into a new two-dimensional array X_, which contains the newly generated secondary term data (x2) and the original primary term data (x). Interested readers can print out X_ to see the effect of a two-dimensional array shown in the figure below, where the first column of data for the constant term data 1 (in fact, is the 0th power of x: x)0), which has no special meaning and will have no effect; the second column of data is the original primary data (which is actually the 1st power of x: x1); the third column of data is the generated quadratic data (which is actually the 2nd power of x: x2), i.e., the newly generated x^2 set.
After generating the quadratic term data, you can obtain a one-quadratic linear regression model based on the same code as before:
regr = LinearRegression()
regr.fit(X_, Y)
- 1
- 2
The above graph can then be plotted by and similar code, note that the predict() function is filled with X_ at this point.
plt.scatter(X,Y)
plt.plot(X, regr.predict(X_), color='red')
plt.show()
- 1
- 2
- 3
By similar means, we can obtain the coefficients a,b and the constant term c of the quadratic regression equation at this point in time with the following code:
print(regr.coef_) # Get coefficientsa, b
print(regr.intercept_) # Get the constant term c
- 1
- 2
The results are as follows:
[ 0. -743.68080444 400.80398224]
13988.159332096873
- 1
- 2
- 3
At this point in the coefficient term for three numbers, the first 0 corresponds to the coefficient in front of the previously generated X_ constant term, also corresponds to the previously said X_ constant term will not have an impact; -743.68 on behalf of the coefficient in front of the X_ primary term, that is, the coefficient of b; 400.8 on behalf of the coefficient in front of the X_ quadratic term, that is, coefficient of a; and 13988 is on behalf of the constant term of c. Therefore, the one-quadratic linear regression equation is:
y = 400.8*x^2 - 743.68*x + 13988
- 1
Using the same method, we can obtain the linear correlation between length of service and salary in the financial industry, automobile manufacturing industry, and food service industry, and the one-dimensional quadratic linear regression model for these four industries is shown below:
As can be seen from the four charts above, the IT industry and the financial industry are currently more developmental and have more room for upward mobility after a period of work, while the automobile manufacturing industry and the food service industry are relatively weaker, but these data are only based on the data of 100 people in each industry, so the conclusions drawn are for reference only.
3.2 Linear regression model assessment
After the model is built, we also need to evaluate the model, and here we mainly use three values as the criterion: R-squared (also commonly known as R^2 in statistics), Adj. R-squared (also known as Adjusted R^2), and P-value. Where R-squared and Adj. R-squared are used to measure the degree of fit of a linear fit, the P-value is used to measure the significance of the characteristic variables.
3.2.1 Programming implementation of model evaluation
Because the mathematics of model evaluation is relatively complex, and this book is intended to be case-based and practical, supplemented by mathematical derivations, hereHow these models are evaluated in the real world is explained first, followed by the relevant mathematical principles for interested readers.
From a practical application point of view, we only need to remember R squared or Adj. R-squaredhigherThen the fit of the modelhigher; if the value of Plower, then the significance of the characteristic variablehigher, i.e., it is really correlated with the predictor variable. (R-squared and Adj. R-squared take values in the range 0-1; the P-value is essentially a probability value, which also takes values in the range 0-1.)
In Python, these three parameters can be viewed with the following code:
import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
print(est.summary()) # Jupyter Notebook can directly write est.summary()print
- 1
- 2
- 3
- 4
where line 1 of the code introduces the library associated with the evaluation of linear regression models: the statsmodels library and abbreviated to sm;
The 2nd line of code adds a constant term to the original feature variable X via the add_constant() function and assigns it to X2 so that there is a constant term in y=ax+b, i.e., the intercept b. Note that the previous Scikit-learn library did not require this step;
Line 3 of the code builds the linear regression equations for Y and X2 via the OLS() and fit() functions;
The fourth line of code prints out the data information of the model as shown below:
First of all, mention that the coef in the lower left corner is the coefficient before the constant term and the eigenvariable, i.e., the intercept b and the slope coefficient a. You can see that it is also consistent with what was previously solved for.
For model evaluation, the R-squared, Adj. R-squared, and P-value information in the box above are usually of interest. Here the R-squared is 0.855 and the Adj. R-squared is 0.854, indicating a good linear fit of the model; there are two P-values here, the constant term (const) and the characteristic variable (length of service), both of which are approximately equal to 0, so that both variables are significantly correlated with the predictor variable (salary), i.e., really correlated, rather than due to to chance.
Interested readers can also see the model evaluation results by setting it up as a quadratic equation according to the additional knowledge points in Section 3.1.3, with the following code:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_ = poly_reg.fit_transform(X)
import statsmodels.api as sm
X2 = sm.add_constant(X_) # It's passed in here with x^2of X_
est= sm.OLS(Y, X2).fit()
est.summary() # In non-Jupyter Notebook editors it needs to be written asprint(est.summary())
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
At this point the printout obtained an R-squared of 0.931, which is indeed an improvement over the previous 0.855.
Supplementary knowledge: another code implementation for obtaining R-squared values
The above is to evaluate the linear regression model by introducing the library related to linear regression model evaluation: statsmodels library to evaluate the linear regression model, so is there a more general method to obtain the R-squared value? Because we will use the GBDT model in Chapter 9, and the XGBoost and LightGBM model in Chapter 10 to conduct regression analysis, then we need a more general method to obtain the R-squared value, the code is as follows:
from sklearn.metrics import r2_score
r2 = r2_score(Y, regr.predict(X))
- 1
- 2
Where Y is the true value and (X) is the predicted value, printing r2 gives a result of 0.855, which can be seen to be consistent with the results previously evaluated through the statsmodels library.
3.2.2 Mathematical principles of model assessment
Above we demonstrated if you can view the R-squared value, Adj. R-squared value and P-value through Python, below we will explain what these three values are in terms of mathematical principles for interested readers.
-squared understanding
First look at R-squared, to understand R-squared, you have to understand three new concepts: the overall sum of squares TSS (Total Sum of Squares), residual sum of squares RSS (Residual Sum of Squares), Explained Sum of Squares ESS (Explained Sum of Squares). For ease of understanding, the relevant content explanations are plotted in the following figure, whereYiis the actual value, Yfitted is the predicted value, and Ymean is the average of all scatters (scatters are not plotted in for image brevity).
As shown in the figure above, the formula for R-squared: 1 - RSS/TSS, for a high degree of fit linear regression model, we hope that the actual value of the actual value as far as possible in the fitting curve, that is, the residual sum of squared RSS as small as possible, that is, R-squared as large as possible. When the RSS tends to 0, the actual value of the basic fall on the fitting curve, the degree of fit is very high, then the R-squared tends to 1, so in practice, the larger the R-squared (the closer to 1), the higher the degree of fit. However, it is not the higher the better, when the degree of fit is too high, it may lead to the phenomenon of overfitting (on the overfitting of the relevant content can be viewed in this subsection of the additional knowledge).
For example, for the following data
We demonstrated its fitting equation as y = x + 2 in Section 3.1.1, as shown below:
where the average of all scatter points, in this demonstration case, because of the small amount of data and the goodness of fit, the predicted value= Yi. as shown below, substituting the data to obtain each sum of squares gives the R-squared value.
Here the R-squared value is 1, i.e. a perfect fit. Furthermore, we can see that, i.e., residual sum of squares + explained sum of squares = overall sum of squares, and this does match what we show in the schematic.
Supplementary knowledge: overfitting and underfitting
As shown in the figure below, the so-called overfitting (referred to as overfitting) means that the model is overfitted in the training samples, and although it fits the training set data well, it loses the ability to generalize, and the model is not generalizable (i.e., it does not achieve better predictions if the data outside of the training set is changed), resulting in poor performance in new data sets. In addition, the opposite of overfitting is underfitting, which means that the model does not fit well, the data is far away from the fitting curve, or the model does not capture the data features well enough to fit the data well.
. Understanding of R-squared
Adj. R-squared is an improved version of R-squared, and its purpose is to prevent the selection of too many feature variables (mainly for multiple linear regression, which will be covered in the next section) from leading to an inflated R-squared.Whenever an additional feature variable is added, it will lead to an increase in the R-squared because of the mathematics behind linear regression, but it is possible that This added feature variable may not help the model, in order to limit too many feature variables, so the concept of Adj. R-squared is introduced, it will additionally take into account the value of the number of feature variables on top of the original R-squared, which is given by the following formula:
Where n is the number of samples and k is the number of eigenvariables, it can be seen that when the number of eigenvariables k is higher, it will actually have a negative impact on Adj. R-squared, so don't add too many eigenvariables in order to pursue a high R-squared. So when the number of feature variables is taken into account, Adj. R-squared can more accurately reflect the degree of fit of the linear model.
For example, for the example given in the previous subsection R-squared, the procedure and results of the calculation of Adj. R-squared are shown below:
It can be seen that for a fully fitted linear equation, Adj. R-squared and R-squared are the same, both are the number 1. If it is not a full fit, for example, the R-squared is 0.9, then the procedure and results of the calculation of Adj. R-squared are shown below:
It can be seen that the Adj. R-squared is indeed smaller than the R-squared at this point.
Understanding of values
The p-value relates to the concept of hypothesis testing in statistics.
whose original hypothesis was that the characteristic variables are not significantly correlated with the predictor variables.
The p-value is the probability that the sample observation or more extreme outcome obtained when the original hypothesis is true will occur.
If that probability is greater, i.e., the greater the p-value, the greater the likelihood that the original hypothesis is true, i.e., the greater the likelihood that there is no significant correlation;
If that probability is smaller, i.e., the smaller the p-value, the less likely it is that the original hypothesis is true, i.e., the more likely it is that there is a significant correlation, so the smaller the p-value, the greater the significant correlation.
Typically, we will use 0.05 as the threshold, when the P-value is less than 0.05, it is considered that the characteristic variable is significantly related to the predictor variable, and the larger the P-value, it means that there is little relationship between the independent variable and the dependent variable.
In the next chapter: logistic regression modeling, section 4.2.3, we will also introduce the evaluation of the model's predictive effectiveness by dividing the test and training sets.
3.3 Multiple linear regression
The essential principle of multiple linear regression is the same as that of univariate linear regression, but because multiple linear regression can take into account the effects of multiple factors on the predictor variables, it is more widely used in business practice.
3.3.1 Mathematical principles and code implementation of multiple linear regression
The principle of the multiple linear regression model is actually similar to that of the one-way linear regression, and its form can be expressed by the following equation:
y = k0 + k1*x1 + k2*x2 + k3*x3……
- 1
Where x1, x2, x3 ...... are different characteristic variables, k1, k2, k3 ...... are the coefficients in front of these characteristic variables, and k0 is a constant term, the multiple linear regression model is also constructed by mathematically calculating the appropriate coefficients to make the following figure The residual sum of squares shown is minimized, whereis the actual value.is the predicted value.
In mathematics, we solve for the coefficients by least squares and gradient descent, and its specific steps can be referred to the supplementary knowledge points in section 3.1.1, and this section mainly explains how to realize it through Python code. Its core code and one-dimensional linear regression is actually the same, the code is as follows.
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
- 1
- 2
- 3
The difference between the above code and the code of univariate linear regression is that X here contains information of multiple characteristic variables. There are many richer cases that can be done by using multiple linear regression, such as predicting income by factors such as length of service, geography, industry, etc., and predicting house price by factors such as size of house, location, and proximity to subway. In the next subsection, we will explain a model that is more common in the banking industry to predict the value of a customer through a multiple linear regression model.
3.3.2 Case Study: Customer Value Prediction Modeling
Using multiple linear regression models we can predict the value of a customer based on a number of factors, and once the model has been built, different business strategies can be constructed for customers of different values.
1. Case background
Here the customer value of credit card customers to explain the specific meaning of customer value prediction: customer value prediction refers to the customer's future period of time can bring how much profit, the source of its profits may come from the credit card's annual fee, cash withdrawal fees, installment fees, foreign transaction fees and so on. After analyzing the value of the customer, in marketing, telephone answering, collection, product consulting and other services, you can target high-value customers to differentiate their services from ordinary customers, which will help to further explore the value of these high-value customers and increase the loyalty of these high-value customers.
2. Read data
Through the following code to read the relevant data, here a total of more than 100 sets of existing customer value data, some of the data has been some simple pre-processing.
import pandas
df = pandas.read_excel('Customer Value Data Sheet.xlsx')
df.head() # Before display5data in rows
- 1
- 2
- 3
At this time, the data are as follows, there are a few points of note: where the customer value is the customer value of 1 year, that is, in 1 year can bring the bank's revenue; education has been preprocessing data, where 2 indicates a high school degree, 3 indicates a bachelor's degree, 4 indicates a postgraduate degree; gender in the 0 indicates female, 1 indicates male (data preprocessing of the relevant knowledge points can be referred to the book, Chapter 11).
At this point, the last 5 columns are independent variables, and "customer value" is the dependent variable, and the following code is used to select the independent variables and dependent variables:
X = df[['Historical loan amount', 'Number of loans', 'Academic qualifications', 'Monthly income', 'Gender']]
Y = df['Customer value']
- 1
- 2
Here again, the independent variable X must be written as a two-dimensional data structure, which is much easier to understand here because it needs to contain multiple feature variables; while the dependent variable Y can be written as a one-dimensional data structure.
3.Model building
According to the knowledge points in subsection 3.1.2, the linear regression model can be built by the following code:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
- 1
- 2
- 3
4. Linear regression equation construction
We can also check the slope factor a and intercept b of the line by using the knowledge points from the previous subsection, as in the following code:
print('The coefficients are:' + str(regr.coef_))
print('The constant term coefficient k0 is:' + str(regr.intercept_))
- 1
- 2
The results of the run are as follows:
The coefficients are:[5.71421731e-02 9.61723492e+01 1.13452022e+02 5.61326459e-02
1.97874093e+00]
The constant term coefficient k0 is:-208.4200407997355
- 1
- 2
- 3
Where here obtained through regr.coef_ is a list of coefficients corresponding to the coefficients in front of the different characteristic variables, i.e., k1, k2, k3, k4, and k5, so at this point the equation of the multiple linear regression curve is:
y = -208 + 0.057*x1 + 96*x2 + 113*x3 + 0.056*x4 + 1.97*x5
- 1
The complete code is shown below:
# 1.Read data
import pandas
df= pandas.read_excel('Customer Value Data Sheet.xlsx')
df.head()
X = df[['Historical loan amount', 'Number of loans', 'Academic qualifications', 'Monthly income', 'Gender']]
Y = df['Customer value']
# 2.Model Training
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
# 3.Linear regression equation construction
print('The coefficients are:' + str(regr.coef_))
print('The constant term coefficient k0 is:' + str(regr.intercept_))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
5. Model evaluation
Using the method of model evaluation from Section 3.2 we can also perform model evaluation for multiple linear regression with the following code:
import statsmodels.api as sm # introduce linear regression models to evaluate related libraries
X2= sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
print(est.summary())
- 1
- 2
- 3
- 4
The results of this run are shown below:
It can be seen that the overall R-squared of the model is 0.571, and the Adj. R-Squared is 0.553, the overall fitting effect is not particularly good, probably because of the reason of the small amount of data, but it is acceptable under the condition of the existing amount of data. At the same time, we again observe the P-value, we can find that the P-value of most of the characteristic variables are small, and indeed are significantly correlated with the predictor variable: customer value, while the P-value of the characteristic variable of gender reaches 0.951, that is, there is no significant correlation with the predictor variable, which is indeed in line with empirical cognition, so in the subsequent modeling, we can in fact remove the characteristic variable of gender.
In addition, here is to know the customer value of the customer and thus the modeling, if we do not know the customer value (i.e., do not know the predictor variables), then it belongs to the unsupervised machine learning, at this time can not be directly predicted on the value of the customer, however, we can use the book chapter 13 of the relevant knowledge of the customer to segmentation of the group, the reader can be interested in the reader can be turned to first.
This concludes the explanation of multivariate linear regression modeling, and there is still much to be explored about multivariate linear regression, such as how to perform theData preprocessingand to addressmulticollinearityThis will be explained in Chapter 11: Data Preprocessing for Feature Engineering. The advantage of linear regression model is that it is very interpretable and fast, the disadvantage is that the model is relatively simple and mainly for linear data, which is a big limitation. In the real business practice, linear regression model is relatively not so much, we will also explain in Chapter 9 and Chapter 10 how to integrate the model: GBDT, XGBoost, LightGBM model to regression prediction model, so here for the linear regression model to do a simple understanding can be used as an introductory knowledge point for machine learning.
3.3.3 Feature Importance Mining
from sklearn.preprocessing import MinMaxScaler
X_new = MinMaxScaler().fit_transform(X)
X_new
- 1
- 2
- 3
Supplementary: 3.3.4 Delineation of test set and training set data
Additional points of knowledge
Solve regular equations
/ty123/p/
Solve one-dimensional cubic equations
import sympy as sp
x = sp.Symbol('x')
f = x**3 + 2*x**2-1
res = sp.solve(f)
- 1
- 2
- 3
- 4
At this point res is the result of the solution. res is a list of 3 solutions as follows:
[-1, -1/2 + sqrt(5)/2, -sqrt(5)/2 - 1/2]
- 1
If you want to see integers or decimals, you can use the int() or float() function:
float(res[1]) # Print the second solution
- 1
Get the results below:
0.6180339887498949
- 1
If you want to retain two decimal places, you can use the round function, decimal place reservation, the code is as follows:
round(float(res[1]), 2)
- 1
Get the result as follows, you can print it with the print() function:
0.62
- 1
- 2
For the rest of the regular equations, see the first link above, which also solves multivariate equations.
Fitting data and finding function expressions
/changdejie/article/details/83089933
2.1 Mode 1: Polynomial fitting
1. The first is to perform a polynomial fit, where it can be mathematically proven that any function can be expressed in polynomial form. Specific examples are given below.
import numpy as np
import matplotlib.pyplot as plt
# Define x and y scatter coordinates
x= [10,20,30,40,50,60,70,80]
x = np.array(x)
print('x is :\n',x)
num = [174,236,305,334,349,351,342,323]
y = np.array(num)
print('y is :\n',y)
# With3The second polynomial fit
f1= np.polyfit(x, y, 3)
print('f1 is :\n',f1)
p1 = np.poly1d(f1)
print('p1 is :\n',p1)
# Also use yvals=np.polyval(f1, x)
yvals = p1(x) # Fitting y-values
print('yvals is :\n',yvals)
# Plotting
plot1= plt.plot(x, y, 's',label='original values')
plot2 = plt.plot(x, yvals, 'r',label='polyfit values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4) # Specify the lower right corner of the location of the legend
plt.title('polyfitting')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
The following results were obtained:
2.2 Approach 2: Specific function fitting
2. The second option is to give a specific function form (which can be arbitrary, as long as you can write it The following func is), and use least squares to approximate and fit it to find the various coefficients of the function as follows, where sqrt seeks to square, i.e., the root x, and square, i.e., square x. If you want to use a power function, use **, e.g., 2**3=8.
# Use curve_fit
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# Custom function e exponential form
deffunc(x, a, b,c):
return a*np.sqrt(x)*(b*np.square(x)+c)
# Define x, y scatter coordinates
x= [20,30,40,50,60,70]
x = np.array(x) # Actually, you don't need to convert it to numpy format, the practice questions verify that
num= [453,482,503,508,498,479]
y = np.array(num)
# Nonlinear least squares fitting
popt, pcov = curve_fit(func, x, y)
# Get the popt inside is the fit coefficient
print(popt)
a = popt[0]
b = popt[1]
c = popt[2]
yvals = func(x,a,b,c) # Fitting y-values
print('popt:', popt)
print('Factor a:', a)
print('Factor b:', b)
print('Factor c:', c)
print('Coefficient pcov:', pcov)
print('Coefficient yvals:', yvals)
# Plotting
plot1= plt.plot(x, y, 's',label='original values')
plot2 = plt.plot(x, yvals, 'r',label='polyfit values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4) # Specify the lower right corner of the location of the legend
plt.title('curve_fit')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
The generated results are as follows:
Practice problem: Python four-parameter logistic fitting
The following data are available
X = [0.13, 0.18, 0.24, 0.32, 0.42, 0.56, 0.75]
Y = [49.78, 49.48, 48.79, 47.22, 43.90, 37.69, 28.42]
- 1
- 2
Find a,b,c,d in the equation: y=(a-d)/(1+(x/c)^b)+d
Answers:
# Use curve_fit
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# Customized functions
deffunc(x, a, b, c, d):
return (a-d)/(1+(x/c)**b)
# Test data1
x = [0.13, 0.18, 0.24, 0.32, 0.42, 0.56, 0.75]
y = [49.78, 49.48, 48.79, 47.22, 43.90, 37.69, 28.42]
# Nonlinear least squares fitting
popt, pcov = curve_fit(func, x, y)
# Get the popt inside is the fit coefficient
print(popt)
a = popt[0]
b = popt[1]
c = popt[2]
d = popt[3]
yvals = func(x, a, b, c, d) # Fitting y-values
print('popt:', popt)
print('Factor a:', a)
print('Factor b:', b)
print('Factor c:', c)
print('Factor d:', d)
print('Coefficient pcov:', pcov)
print('Coefficient yvals:', yvals)
# Plotting
plot1= plt.plot(x, y, 's',label='original values')
plot2 = plt.plot(x, yvals, 'r',label='polyfit values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4) # Specify the lower right corner of the location of the legend
plt.title('curve_fit')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
The results are obtained as shown below:
If the test data is as follows, it can be replaced with the relevant data:
# Test data2
x = [900, 300, 100, 33.3, 11.1, 3.7, 1.23, 0]
y = [4.2, 3.994, 3.3865, 1.8915, 0.7275, 0.302, 0.143, 0.066]
- 1
- 2
- 3
The result of the acquisition is shown below:
Exercise: Model Evaluation
R^2, the fit of the regression model, can be calculated by using the following code, which is shown below:
from sklearn.metrics import r2_score
r2 = r2_score(y, yvals)
- 1
- 2
Printing r2 yields the fit as shown below:
0.9996809509783059
- 1
Click to download.
3.4 Course-related resources
How to get the author: micro-signal to get
Add the following wechat: huaxz001 .
The author's website:
Yutao Wang related courses can be passed:
Jingdong link:[/Search?keyword=Wang Yutao], search "Wang Yutao", in Taobao, Dangdang can also buy. To join the study exchange group, you can add the following WeChat: huaxz001 (please specify the reason).
Various courses can be viewed by searching for Yutao Wang on NetEase Cloud and 51CTO.