The project I am working on recently happened to use logistic regression, so I just learned it systematically. This blog post recorded my study notes, project ideas and code. Its calculation principles are introduced by many websites and books, so I won’t show off my skills here. The main thing is to record how I implement them
1. Introduction to Logical Regression
Logistic Regression algorithm is a hypothetical function of learning sample features and sample labels by training positive and negative samples in data. It is a typical linear classifier and a type of generalized linear model. It is highly interpretable and widely used. For example, e-commerce uses this model to determine whether users will choose a certain payment method, financial companies divide users into different credit levels through this model, tourism industry uses this model to complete the probability of hotel customers' churn, medical industry uses this model to find "bad" factors that affect a certain disease, advertising industry uses this model to estimate click-through rate, etc.
advantage:The output results will have a good probability explanation, and the algorithm can also be regularized to avoid overfitting; it is easy to update the data model through stochastic gradient descent; the calculation is not expensive, it is very fast, and the storage resources are low.
shortcoming:It is easy to be underfitted, and the classification accuracy may not be high; the performance is poor when facing multivariate or nonlinear decision boundaries; logistic regression is the classification method corresponding to linear regression, and the basic concept is derived from linear regression, so this algorithm will perform well only when the data is linearly divided (for example, the data can be completely separated by a certain decision plane).
2. Python implementation
In Python, you can use the submodule linear_model of sktlearn to call the LogisticRegression class. The syntax and parameter meanings are as follows:
penalty:Regularized penalty item, str type, optional parameters are l1 and l2, and the default is l2. Specification used to specify penalties. The newton-cg, sag and lbfgs solution algorithms only support L2 specifications. The L1 specification assumes that the parameters of the model meet the Laplace distribution, and the model parameters of the L2 assumes that the model meet the Gaussian distribution. The so-called paradigm is to add constraints on parameters so that the model will not be overfitted. However, if you want to say whether the constraint is added, it will be good. No one can answer this. It can only be said that when the constraint is added, the results with stronger generalization ability should theoretically be obtained.
dual:Whether to solve dual form, bool type, default is False. The dual method is only used to solve the L2 penalty term of a linear multi-core (liblinear). Dual is usually set to False when the number of samples > sample features.
tol:Used to specify the threshold for model fall convergence, float type, default is 1e-4. It means that when you solve it to a certain extent, stop and think that the optimal solution has been found.
C:The inverse of the regularization coefficient λ, float type, defaults to 1.0. Must be a positive floating point number. Like SVM, smaller values represent stronger regularization.
fit_intercept:Whether there is an intercept or deviation, bool type, defaults to True.
intercept_scaling:It is only useful if the regularization term solver is "liblinear" and fit_intercept is set to True. The float type, default is 1, mainly to reduce the influence of the artificially set constant sequence 1 in the X matrix.
class_weight:It is used to indicate various types of weights in the classification model. It can be a dictionary or a 'balanced' string. By default, it is not input, that is, it does not consider the weight, which is None. If you select input, you can choose balanced to let the class library calculate the type weights by itself, or enter the weights of each type by itself. For example, for a binary model of 0 and 1, we can define class_weight={0:0.9,1:0.1}, so that the weight of type 0 is 90%, while the weight of type 1 is 10%. If class_weight selects balanced, the class library calculates the weight based on the training sample size. The more sample sizes of a certain type, the lower the weight, and the smaller the sample size, the higher the weight. When class_weight is balanced, the class weight calculation method is as follows: n_samples / (n_classes * (y)). n_samples is the number of samples and n_classes is the number of categories. (y) will output the number of samples of each class, for example, y=[1,0,0,1,1], then (y)=[2,3].
class_weight actually works as follows:
In classification models, we often encounter two types of problems:
The first is that misclassification is very expensive. For example, it is very expensive to classify legal users and illegal users. We would rather classify legal users as illegal users. At this time, we can manually identify them, but we are unwilling to classify illegal users as legal users. At this time, we can appropriately increase the weight of illegal users.
The second type is that the samples are highly imbalanced. For example, we have 10,000 binary sample data of legal users and illegal users, with 9,995 legal users and only 5 illegal users. If we do not consider the weight, we can predict all test sets as legal users. In this way, the prediction accuracy is theoretically 99.95%, but it has no meaning. At this time, we can choose balanced to allow the class library to automatically increase the weight of illegal user samples. The weight of a certain category is increased. Compared with not considering the weight, more sample classifications will be divided into high-weight categories, which can solve the above two types of problems.
random_state:Random number seed, int type, optional parameters, default to none, and is only useful when the regularization optimization algorithm is sag and liblinear.
solver:The optimization algorithm selects parameters, and there are only five optional parameters, namely newton-cg, lbfgs, liblinear, sag, saga. Default is liblinear. The solver parameter determines our optimization method for logistic regression loss function. There are four algorithms to choose from, namely:
liblinear: It uses the open source liblinear library implementation, and the axis descent method is used internally to iteratively optimize the loss function.
lbfgs: A type of quasi-Newtonian method, which uses the second-order derivative matrix of the loss function, namely the Heisen matrix, to iteratively optimize the loss function.
newton-cg: It is also a type of Newton's family. It uses the second-order derivative matrix of the loss function, namely the Heisen matrix, to iteratively optimize the loss function.
sag: that is, stochastic average gradient descent, which is a variant of the gradient descent method. The difference between it and the ordinary gradient descent method is that only a part of the samples are used to calculate the gradient in each iteration, which is suitable for when there are a lot of sample data.
saga: The heavier of linear convergence stochastic optimization algorithm.
Description of each optimization algorithm:
liblinear is suitable for small datasets, while sag and saga are suitable for large datasets because it is faster.
For multi-classification problems, only newton-cg, sag, saga and lbfgs can handle multiple losses, while liblinear is limited to a pair of residuals (OvR). That is, when using liblinear, if it is a multi-category problem, one category must be used as one category and all remaining categories as another category. And so on, iterate through all categories and classify.
The three optimization algorithms, newton-cg, sag and lbfgs, all require first-order or second-order continuous derivatives of the loss function, so they cannot be used for L1 regularization without continuous derivatives, but can only be used for L2 regularization. And liblinear and saga take L1 regularization and L2 regularization.
At the same time, sag only uses some samples for gradient iteration each time, so don’t choose it when the sample size is small, and if the sample size is very large, such as greater than 100,000, sag is the first choice. However, sag cannot be used for L1 regularization, so when you have a large number of samples and need L1 regularization, you have to make your own choices. Either reduce the sample size by sampling the sample or return to L2 regularization.
From the above description, you may think that since there are so many restrictions on newton-cg, lbfgs and sag, if it is not a large sample, wouldn’t we just choose liblinear! Wrong, because liblinear also has its own weaknesses! We know that logistic regression has binary logistic regression and multiple logistic regression. For multiple logistic regression, there are two common ones: one-vs-rest (OvR) and many-vs-many (MvM). MvM is generally more accurate than OvR classification. What’s depressed is that liblinear only supports OvR and not MvM. In this way, if we need relatively accurate multi-logistic regression, we cannot choose liblinear. It also means that if we need relatively accurate multivariate logistic regression, we cannot use L1 regularization.
max_iter:The maximum number of iterations of the algorithm converges, int type, default is 100. Only when the regularization optimization algorithm is newton-cg, sag and lbfgs, the maximum number of iterations of the algorithm converge.
multi_class:The classification method selects parameters and str type. If there are more than two dependent variables, you can specify the solution to polyphenol problems through this parameter. The optional parameters are ovr and multinomial, and the default is ovr. ovr is the one-vs-rest(OvR) mentioned above, and multinomial is the many-vs-many(MvM) mentioned above. If it is a binary logistic regression, there is no difference between ovr and multinomial, and the difference is mainly in the multivariate logistic regression.
The difference between OvR and MvM
The idea of OvR is very simple. No matter how much metalogical regression you have, we can think of it as binary logistic regression. The specific approach is that for the classification decision of Category K, we take all the samples of Category K as positive examples, and all samples except Category K as negative examples, and then do binary logistic regression on it to obtain the classification model of Category K. Classification models of other classes are obtained and so on.
MvM is relatively complicated, so here is a special case of MvM one-vs-one (OvO) to explain it. If the model has T classes, we select two types of samples from all T class samples every time, and may as well record them as T1 and T2, put all samples output as T1 and T2 together, take T1 as a positive example and T2 as a negative example, perform binary logistic regression to obtain model parameters. We need T(T-1)/2 classifications in total.
It can be seen that OvR is relatively simple, but the classification effect is relatively poor (here refers to most sample distributions, and OvR may be better under some sample distributions). The MvM classification is relatively accurate, but the classification speed is not as fast as OvR. If ovr is selected, the four optimization methods of loss functions can be selected by liblinear, newton-cg, lbfgs and sag. However, if multinomial is selected (i.e. Softmax classification), you can only choose newton-cg, lbfgs and sag.
verbose:Log verbose, boolean type, default is 0, which means that the training process is not output. When 1 is, the result is occasionally output, which is greater than 1, and is output for each submodel.
warm_start:The hot start parameter, bool type, defaults to False, which means that each iteration starts from scratch. If True, the next training is performed as an append tree (reusing the last call as initialization).
n_jobs:Parallel number, that is, the number of CPUs used during model operation, the int type is 1. When 1, the program is run with one core of the CPU, and when 2, the program is run with two cores of the CPU. When -1, run the program with the kernel of all CPUs.
The introduction to the above parameters is reproduced from/kingzone_2008/article/details/81067036
Implementation of project procedures
The main purpose of this project is to find cities that are positive in this industry and to be used as selected cities in the future. That is, the training sample is a macroeconomic indicator and industry-related indicator; the prediction sample is a macroeconomic indicator, and the purpose of the prediction is to screen out the positive cities in the industry based on the macroeconomic indicators. The overall idea is relatively simple: first, the training sample is labeled as two types of industry positive and industry negative according to industry indicators, and the logistic model is used for training to find the correspondence between the industry positive cities and macroeconomic indicators; then use the prediction sample to give the industry positive cities as a choice.
The complete code is as follows:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression,LogisticRegression,Ridge,RidgeCV,Lasso, LassoCV
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score,cross_validate
from sklearn import metrics as mt
from .outliers_influence import variation_inflation_factor
# Read data data_path = 'C:\\Users\\90539\\PycharmProjects\\data\\' # File path file_name = '' # File name, dependent variable in the last column type_id = ['B'] # Brand to which it belongs (analyze according to the brand when analyzing)data = pd.read_excel(data_path + file_name,index_col='ccom_id').dropna() # Delete missing recordsdata1_type = data[data['type_id'].isin(type_id)] # Remove the brand recorddata2_type = data1_type[data1_type['year1'] > 2016] # Use data after 2017 as training sampledata_type = data2_type.drop(['year1','year','type_id'], 1) # Some columns are time or brand, so they need to be deleted
# Delete independent variables with strong correlations (Remove the one with the larger average absolute correlation coefficient)corrX = data_type.corr() # Correlation coefficient matrix colnames = data_type.columns colnames3 = list() # Used to store the variable name to be removed Thred_corr = 0.5 # The correlation coefficient threshold, that is, if the correlation coefficient of two independent variables is greater than 0.5, only one is retained for j in range([1] - 1): # Delete variables with correlation coefficient greater than 0.5 for i in range(j + 1, [0] - 1): if abs([i, j]) >= thred_corr: if ([i, :]) < ([:, j]): # Remove the one with a larger average absolute correlation coefficient (colnames[j]) else: (colnames[i]) break colnames4 = (list(set(colnames3))) data2_type = data_type[colnames4] # Delete collinear variables. Generally, if vif is greater than 10, it is considered to be very collinear. In order to select fewer variables, this article sets the vif threshold to 4. vif = [round(variance_inflation_factor(data2_type.values, i),2) for i in range(data2_type.shape[1])] # Variance expansion factor thred_vif = 4 # vif threshold data3_type = data2_type.copy() while sum(list(map(lambda x:x>=thred_vif,vif))) > 0: colnames = data3_type.columns[:-1] for i in range(vif.__len__()-1): # Delete variables with strong collinearity if vif[i] >= thred_vif : data3_type = data3_type.drop(colnames[i], 1) vif = [round(variance_inflation_factor(data3_type.values, i), 2) for i in range(data3_type.shape[1])] # Variance expansion factor break
# Because there are many variables that are repeated (one is city-wide and the other is town-wide), but the first two methods cannot be filtered out, so I will filter it manually
corrX3 = data3_type.corr()
colnames5 = data3_type.columns[:-1]
colnames6 = ()
for i in range(colnames5.__len__()):
for j in range(i+1,colnames5.__len__()):
if colnames5[j].replace('_ub_incr', '').__eq__(colnames5[i].replace('_incr', '')):
# print(colnames5[j]);print("and");print(colnames5[i])
if ([i, :]) < ([:, j]):
colnames6 = (colnames5[i])
else:
colnames6 = (colnames5[j])
break
data4_type = data3_type[colnames6.to_list() + [data3_type.columns[-1]]]
# Binaryize the last column of dependent variables, that is, if it is greater than thred_incr, it is considered a city that is positive in the industry.
data6_type = data4_type.copy()
thred_incr = 0
for i in range(nrow):
if data4_type.iloc[i, j] > thred_incr :
value = 1
else:
value = 0
data6_type.iloc[i,-1] = value
# First divide the data into training and testing samples
data_train,data_test = train_test_split(data6_type,test_size=0.2, random_state=1234) # random_state is to retain seeds and ensure that the number of each run is the same
col_names = data_train.columns
X = data_train[col_names[:-1]] # independent variable
y = data_train[col_names[-1]] # dependent variable
# Logical Regression
model = LogisticRegression()
(X, y)
coef = model.coef_ # Regression coefficient
coef_regression = (index=['Intercept'] + (), data=[model.intercept_[0]] + ()[0])
names = pd.read_excel(data_path + 'economic indicator name.xlsx',index_col="short_name").dropna() # Import the Chinese name of the economic indicator, just to print it out to make it clearer
names2 = []
for x in coef_regression.index:
for y in :
if y==("_incr",""):
(names[([y])].iloc[:,-1])
break
coef_regression2 = (index=['Intercept'] + names2, data=[model.intercept_[0]] + ()[0])
print("regression coefficients are: "); print(coef_regression2); print("\t")
# Test sample
X_test = data_test[col_names[:-1]]
y_test = data_test[col_names[-1]]
predictY = (X_test) # predicted value
# Use confusion matrix to evaluate the quality of the model
acu = .accuracy_score(y_test,predictY)
sen = .recall_score(y_test,predictY,pos_label=1)
spe = .recall_score(y_test,predictY,pos_label=-1)
print('Model accuracy is: %.2f%%' % (acu * 100))
print('Positive coverage is: %.2f%%' % (sen * 100))
print('Negative coverage is: %.2f%%' % (spe * 100)); print('\t')
# Finally import the prediction sample for actual prediction
data_new = data[data['year1']==2016] # Use the 2016 sample to predict. The reason why you don't need to be a training sample is because this part of the data is relatively small
data_new2 = data_new.drop(['year1', 'year', 'type_id'], 1) # Delete irrelevant variables
colnames = [1:] # Since the intercept term is considered, start from 1
X_pred = data_new2[colnames]
predictY = (X_pred)
result = (index=X_pred.(),data=())
incr_ccom = result[result>=1].index # Name of the industry positive city
The steps considered in this project are relatively simple and there are many shortcomings. They are just for the purpose of making it later, but criticism and correction are welcome.