Experimental Background:
This is real data from 2010/01/12-2011/12/09 from an online e-commerce company in the UK, so the data types and values are highly randomized and practical.
The main business of this e-commerce company is to sell some customized gifts. So the purpose of this analysis is to categorize the customers of this e-commerce company in order to allow the business department to have different promotions (marketing initiatives or offers) for different customers in order to increase sales.
Experimental content:
Data preprocessing:
1. Introduce observational data:
#import all necessary package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
- 1
- 2
- 3
- 4
- 5
- 6
- 7
Data source:/carrie1/ecommerce-data
#load data
df=pd.read_csv(r'E:\test\ecommerce-data\',encoding='ISO-8859-1')
df.head()
- 1
- 2
- 3
Observe that there are a total of 8 columns of data, and based on the first 5 columns that have been displayed, it is found that the data is characterized by the fact that the same customer ID may have purchased multiple items in the same INVOCE, and is displayed row by row.
2. Clean up the data:
# Observational data
df.info()
- 1
- 2
Focusing on the fact that InvoiceDate is in object format, it is better to transform it to Datetime format for easy calculation later.
# Utilizing the Datetime function, add the specified format to identify the parameters
df['InvoiceDate']=pd.to_datetime(df.InvoiceDate,format='%m/%d/%Y %H:%M')
- 1
- 2
- 3
3. Check for null values
df.isnull().sum()
- 1
You can see that the CID has quite a few null values, and since the properties of the null rows were not shown just now, they are now listed separately for analysis:
Nan_rows=df[df.isnull().T.any()]
Nan_rows.tail(5)
- 1
- 2
Since we can't extrapolate from other columns or interpolate to complete the CID based on existing data, it is categorized as dirty data deletion
df=df.dropna(subset=['CustomerID'])
df.isnull().sum()
- 1
- 2
When we remove the CID, the Description is also 0. This means that some of the previous rows with NaN CIDs also have NaN Descriptions.
4. Repeat value checking
df.duplicated().sum()
- 1
The duplicate value is just over 5,000 rows and has little impact, so we choose to remove the
df.drop_duplicates(inplace=True)
- 1
After the data has been cleaned, we can make observations about its characteristics
Data analysis and observation
1: Find out how many kinds of Country there are:
df.Country.unique()
- 1
Customers come from many countries, so is there a strong correlation between the number of orders placed and the country? We can sort and visualize the number of orders per country and analyze it:
#First remove duplicate CIDs to ensure that a customer is only counted once
order_unique_cus=df.drop_duplicates(['CustomerID'], keep='first', inplace=False)
order_sum=order_unique_cus.groupby(by='Country').UnitPrice.count()
order_sum=order_sum.sort_values(ascending=False).head(5)
plt.figure(figsize=(20,5))
sns.barplot(order_sum.index,order_sum.values,palette="Greens_r")
plt.ylabel('Counts')
plt.title('Num of orders in different country',fontsize=15,color='r',fontweight='bold')
plt.xticks(rotation=45)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
Obviously it is possible to observe until the number of purchasers in the UK accounts for the main consumption, calculated as a percentage of:
order_unique_cus.loc[order_unique_cut.Country=='United Kingdom'].shape[0]/order_unique_cut.shape[0
- 1
In summary, since the main consumer group is the United Kingdom, and the buying habits of users in each country are different, the subsequent RFM analysis will only be conducted for customers in the UK
df.describe()
- 1
We can see that both QUANTITY and Unitprice have extremely large impacts, and the image should be in a serious long-tailed state.
And it was found that there were surprisingly negative numbers in QUANTITY, which were taken out and observed:
#option1:
df[df['Quantity']<0].head(5)
#option2:
df1=df.query("Quantity<0").head(5)
- 1
- 2
- 3
- 4
It is found that when Q is negative, InvoiceNo is preceded by the appearance of C. After confirming that the beginning of C means canceled orders.
There is also a situation where the stockcode is D, followed by a comment that the product is a discounted item.
This data is causing us a lot of problems because it is being canceled for a number of reasons, such as the wrong shot or not wanting it, etc. This can be analyzed later when focusing on churn, and we can now cancel them first.
# With DROP, remove rows with <0 (indexed by Index)
df.drop(df[ df.Quantity<0].index, inplace=True)
df.describe()
- 1
- 2
- 3
Now that the data is basically completely "clean", we can start RFM analysis:
RFM analysis of UK customers
First we choose days as the smallest unit of time and create a new column date.
import datetime
df['date']=df['InvoiceDate'].apply(lambda x: x.date)
- 1
- 2
DATE column has been created, and we saved the original InvoiceDate for easy comparison later, the definition of RFM will not repeat, the following is the implementation of each parameter for the experiment:
R: the most recent purchase. We use the last day (max) as "today" minus the time of each customer's last purchase.
F: Frequency of purchases per customer
M: Total sales per customer: unitprice*quantity
# Creation of total consumption columns
df['M']=df['UnitPrice']*df['Quantity']
# Create an RFM pivot table by bringing up the columns to be analyzed
rfm=pivot_table(index='CustmerID',values=['date','InvoiceNo','M'],aggfunc=['date':'max','InvoiceNo':'count','M':'sum'])
# Change the column name of Invoice to F
rfm.rename(columns={'Invoice':'F'},inplace=True)
# Calculate the most recent consumption and convert the result type to a computable data type
rfm['R']=(rfm.date.max()-rfm.date)/np.timedelta64(1,'D')
# Take out again the 3 columns that need to be analyzed
rfm1=rfm[['R','F','M']]
rfm1
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
Next, we perform RFM analysis using **Clustering (KMEANS)**:.
One important parameter in clustering is K, which is the number of categories into which we can best categorize the data. This can have underfitting and overfitting problems, so the choice of K is crucial. The elbow method is chosen for this review:
The core of the elbow method is to observe the SSE (sum of squared errors), as you categorize more and more, the sample division becomes more and more refined, the SSE will naturally decrease gradually, but when K increases to a certain level, it will not have a great impact on the results. So we draw the classification from 1 to 10, look at the trend of SSE, and select the number of classes corresponding to the inflection point.
#We choose 1-10 and evaluate each one #
SSE=[]
for k in range(1,10):
estimator=Kmeans(n_cluster=k)
estimator.fit(rfm1)
SSE.append(estimator.inertia_)
x=range(1,10)
plt.plot(x,SSE,'x-')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
When K=3, the best classification will be obtained, and next we can use the kmeans package to calculate the
#3 was chosen as the appropriate cluster to be fitted
kmeans=KMeans(n=cluster=3,random_state=0.fit(rfm1)
# Get the label of each point
rfm1['cluster']=kmeans.labels_
rfm1.sort_values(by='cluster')
- 1
- 2
- 3
- 4
- 5
Seeing that all customers have been categorized into 3 categories, let's move on to explore the characteristics of each category
R & Cluster
sns.boxyplot(rfm1.cluaster,rfm.R)
plt.title('boxplot across R and Cluster',fontsize=15,color='r',fontweight='bold')
plt.savefig(r'E:\test\ecommerce-data\boxplot across R and ')
- 1
- 2
- 3
First of all since R represents the most recent consumption, it needs to be as small as possible, according to the box line distribution graph:
- The first category has a wide distribution of customers, and the extremes are too high, making it the "worst" performer.
- The third category also has higher extreme values compared to the second category, in addition to the occurrence of some outliers.
- The second category is the best performer, with all the lowest extreme median averages, and belongs to our "Golden VIPs".
The iso-polarity of customers analyzed from R is: 2>3>1
F & Cluster
sns.boxplot(rfm1.cluster,rfm1.F)
plt.title('boxplot across F and Cluster',fontsize=15,color='r',fontweight='bold')
plt.savefig(r'E:\test\ecommerce-data\boxplot across F and ')
- 1
- 2
- 3
Since F stands for frequency, we would expect a larger F to indicate a high number of buys, based on the box line distribution chart:
- Clients in the first category have individual extremes, but generally the median, including the mean, is low and performance remains less optimistic.
- The second category belongs to the intermediate level, with moderate performance.
- The third category is the best performer, with not only a large number of great values, but also the highest median mean, so they still occupy the "Golden VIP" seat.
Analyzing the iso-polarity of customers from F is: 3>2>1
M & Cluster
sns.boxplot(rfm1.cluster,rfm1.M)
plt.title('boxplot across M and Cluster',fontsize=15,color='r',fontweight='bold')
plt.savefig(r'E:\test\ecommerce-data\boxplot across M and ')
- 1
- 2
- 3
Finally M is also the total amount of money spent that the merchant cares about the most, the higher the number of dollars bought, the more profitable the merchant is, according to the box-and-line distribution chart:
- Customers in the first category continue to have the lowest values.
- The third category belongs to the intermediate level, where spending is generally between 50,000 and 100,000 dollars.
- The performance of the second category is the most striking, and he should also be where the extremes were when we analyzed the data earlier, with the median all but approaching 200,000
Analyzing the iso-polarity of customers from M is: 2>3>1
To summarize: We can get the second category is the most need to preserve the value of the customer, we can list them as our "golden VIP" to improve retention. The third category is followed by the first category is relatively inferior to the ordinary users.
However, in the clustering, we have only three major categories.
Depending on the business requirements, if you want to categorize each metric with reference to the classic RFM, with the aim of also being able to enhance the precision of your marketing based on a finer segmentation, this will be achieved using a different programming approach.
Of course, specifically for each indicator division point in the real work to discuss with the business, if in the absence of empirical data based on the quartile function for grouping is the most scientific.
My approach is to define 2 different functions first, because of the 3 parameters of the medium need R as low as possible, while F and M are as high as possible.
# With each new calculation, I favor creating a variable so that the integrity of the original file is guaranteed.
rfm2=rfm[['R','F','M']]
# Divide the raw RFM data into 4 equal parts
quantile=rfm2.quantile(q=[0.25,0.5,0.75])
def Rscore(x,col):
if x<=quantile[col][0.25]:
return 1
elif x<=quantile[col][0.5]:
return 2
elif x<=quantile[col][0.75]:
return 3
else:
return 4
def FMScore(x.col):
if x<=quantile[col][0.25]:
return 4
elif x<=quantile[col][0.5]:
return 3
elif x<=quantile[col][0.75]:
return 2
else:
return 1
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
Next, apply the function to the rfm:
rfm2['Rscore']=rfm2['R'].apply(Rscore,args=('R'))
rfm2['f_score']=rfm2['F'].apply(FMscore,args=('F'))
rfm2['m_score']=rfm2['M'].apply(FMscore,args=('M'))
- 1
- 2
- 3
At this point, there are 2 ways to do it:
The first is the direct merger method: 1+1+1=111 and then the classification is determined.
The second is to sum the scores of each group using weights:
The second is used for this analysis, but the implementation of the first is written below:
# Direct Merge: Since rfm is a numeric variable, we need to add them as strings.
rfm2['str_score']=rfm2['r_score'].apply(str)+rfm2['f_score'].apply(str)+rfm2['m_score'].apply(str)
# or use map to directly map each column of values one by one to STR
rfm2['rfmscore']=rfm2['r_score'].map(str)+rfm2['f_score'].map(str)+rfm2['m_score'].map(str)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
The weighting method of the second:
# I set the weight of M to 0.5 as most important, R to 0.2, and F to 0.3
rfm2['rfm_total_score']=rfm2['r_score']*0.2+rfm2['f_score']*0.3+rfm2['m_score']*0.5
- 1
- 2
Get the chart below:
After getting the scores, we labeled the customers for each class:
bins=rfm2['rfm_total_score'].quantile(q=[0,0.125,0.25,0.375,0.5,0.625,0.75,0.875,1])
bins[0] =0
labels = ['Lost customers','General maintenance clients','General Development Clients','Potential customers','Critical retention of clients','Critical retention of customers','Key development clients','Significant Value Customers']
rfm2['level'] = pd.cut(rfm2.rfm_total_score,bins,labels=labels)
- 1
- 2
- 3
- 4
- 5
Visualization:
#This piece is to ensure that the visual output Chinese is not the problem of the small square #
import matplotlib.pyplot as plt
plt.rcParams['-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(15,8))
# Prepare the table to be visualized
cus=rfm2.groupby('level').count().sort_values('R')
x=cus.index
y=cus.R
# Drawing retouching
plt.bar(x,y,alpha=0.9, facecolor = 'lightskyblue', edgecolor = 'white')
plt.xticks(x,rotation=60,fontsize=15)
plt.title('The Number Of Customer \n In Different Type',fontsize=20,color='r',fontweight='bold')
plt.ylabel('Num of Customer',fontsize=15)
plt.xlabel('Level',fontsize=15)
plt.savefig(r'E:\test\ecommerce-data\The number of customer in different ')
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
From this we can see that important development customers accounted for the vast majority, indicating that the prospect of the e-commerce is still very optimistic, you can ensure that the important value of the customer remains based on the basis of increased marketing efforts, etc., will be more important development of the customer into a loyal customer, expanding the business.
Results
We have utilized clustering and split-boxing respectively to achieve RFM customer stratification, and after combining it with the business, we have come up with some observations for marketing. And since the type of this data volume is very comprehensive, after that we can also go to analyze the company's situation from different dimensions, including using more algorithms to test which parameter has the biggest impact on the total sales, in order to provide some more complete and practical recommendations and suggestions.
PS: This is the second report I wrote on data analysis, if you find any errors welcome to leave a message in the comments section, so that we can improve each other! If not, you are welcome to give an encouragement by liking it! hehehe
The link below is to my first Virgo data analysis report, which focuses on visualizing some exercises for 2019 airbnb NYC listings, and you're welcome to move over and check it out as well. (Twelve sells itself, haha):
/MRmaand12/article/details/105788430