web123456

k mean clustering algorithm test example_k means clustering algorithm example

The so-called clustering problem is to give a set of elements D, where each element has n observable properties, using some kind ofalgorithmDividing D into k subsets requires that the dissimilarity between elements within each subset is as low as possible, while the dissimilarity between elements in different subsets is as high as possible. Each subset is called a cluster.

Unlike classification, classification is an example learning, requiring that each category be identified before classification and asserting that each element is mapped to a category, while clustering is observational learning, and that the category can be unknown before clustering or even the number of categories is not given. It is a type of unsupervised learning. Currently, clusters are widely used in fields such as statistics, biology, database technology and marketing, and there are also many corresponding algorithms.

K-Means algorithm example

Example: The following is a set of users' age data, define the K value as 2 to cluster users. 16 and 22 are randomly selected as the initial centroids of the two categories.

Data_Age = [15,15, 16, 19, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65];

CenterId1 =16, CenterId2 = 22

(1) Calculate distance and divide data

The user is classified for the first time by calculating the distance between the age values ​​of all users and the initial centroid. The method of calculating distance is to use the Euro-style distance. The smaller the distance value indicates the higher the age similarity between the two users.

The first iteration:

Data_Age = [15,15, 16, 19, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65];

Distance(16)= [1, 1, 0, 3, 3, 4, 4, 5, 6, 12, 19, 24, 25, 26, 27, 28,44, 45, 49];

Distance(22)= [7, 7,6, 3, 3, 2, 2, 1, 0, 6, 13, 18, 19, 20, 21, 22, 38, 39,43];

Group1_(16)= [15,15, 16]; Mean =15.33

Group2_(22)= [19,19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65]; Mean = 36.25

(2) Use the mean as the new centroid

Take the mean of the data in the two groups as the new centroid, and repeat the previous method, iteratively calculate the distance from each data point to the new centroid, and divide the data points into categories with a smaller distance from it.

The second iteration:

Data_Age = [15,15, 15.33, 16, 19, 19, 20, 20, 21, 22, 28,35, 36.25, 40, 41, 42, 43, 44, 60, 61, 65];

Distance(15.33)=[0.33, 0.33, 0.67,3.67, 3.67, 4.67, 4.67, 5.67, 6.67, 12.67, 19.67, 24.67, 25.67, 26.67,27.67, 28.67, 44.67, 45.67, 49.67];

Distance(36.25)=[21.25, 21.25, 20.25, 17.25, 17.25, 16.25,16.25, 15.25, 14.25, 8.25, 1.25, 3.75, 4.75, 5.75,6.75, 7.75, 23.75, 24.75, 28.75];

Group1_(15.33)=[ 15, 15, 16, 19, 19, 20, 20, 21, 22]; Mean = 18.56

Group2_(36.25)=[ 28, 35, 40, 41, 42, 43, 44, 60, 61,65]; Mean = 45.90

The third iteration:

Data_Age = [15,15, 16, 18.56, 19, 19, 20, 20, 21, 22, 28,35, 40, 41, 42, 43, 44, 45.90, 60, 61, 65];

Distance(18.56)=[3.56, 3.56, 2.56,0.44, 0.44, 1.44, 1.44, 2.44, 3.44, 9.44, 16.44, 21.44, 22.44, 23.44,24.44, 25.44, 41.44, 42.44, 46.44];

Distance(45.90)=[30.90, 30.90, 29.90, 26.90, 26.90, 25.90,25.90, 24.90, 23.90, 17.90, 10.90, 5.90, 4.90, 3.90,2.90, 1.90, 14.10, 15.10, 19.10];

Group1_(18.56)=[ 15, 15, 16, 19, 19, 20, 20, 21, 22, 28]; Mean = 19.50

Group2_(45.90)=[ 35, 40, 41, 42, 43, 44, 60, 61, 65]; Mean = 47.89

The fourth iteration:

Data_Age = [15,15, 16, 19, 19, 19.50, 20, 20, 21, 22, 28,35, 40, 41, 42, 43, 44, 47.89, 60, 61, 65];

Distance(19.50)=[4.5, 4.5, 3.5,0.5, 0.5, 0.5, 0.5, 1.5, 2.5, 8.5, 15.5, 20.5, 21.5, 22.5, 23.5, 24.5,40.5, 41.5, 45.5];

Distance(47.89)=[32.89, 32.89, 31.89, 28.89, 28.89, 27.89,27.89, 26.89, 25.89, 19.89, 12.89, 7.89, 6.89, 5.89,4.89, 3.89, 12.11, 13.11, 17.11];

Group1_(19.50)=[ 15, 15, 16, 19, 19, 20, 20, 21, 22,28]; Mean = 19.50

Group2_(47.89)=[ 35, 40, 41, 42, 43, 44, 60, 61, 65]; Mean = 47.89

(3) Algorithm stop condition

The distance from each data to the new center of mass is iteratively calculated until the new center of mass is equal and the original center of mass is over.

MATLABIn-housekmeansfunction

The kmeans function in MATLAB uses the division of N*P matrix X into K classes, so that the distance between objects in the class is the largest and the distance between classes is the smallest.

How to use:

Idx = Kmeans(X,K)

[Idx, C] = Kmeans(X,K)

[Idc, C, sumD] = Kmeans(X,K)

[Idx, C, sumD, D] = Kmeans(X,K)

Introduction to each input and output parameters:

X---N*P data matrix

K--- means to divide X into several categories and be an integer

Idx---N*1 vector, which stores the clustering number of each point

C---K*P matrix, which stores K cluster centroid positions

sumD----The sum vector of 1*K, which stores the sum of distances between all points between classes and the center of mass points of this class.

D---N*K matrix, which stores the distance between each point and all centroids

[┈] = Kmeans(┈,’Param1’,’Val1’,’Param2’,’Val2’,┈)

The parameters Param1, Param2, etc. can be mainly set as follows:

1. ’Distance’---Distance Measurement

‘sqEuclidean’--Euclidean distance

‘cityblock’--Absolute error sum, also known as L1

‘cosine’---For vectors

‘correlation’--values ​​for time-sequential relationships

‘Hamming’---only for binary data

2. ’Start’---Select method of initial centroid position

‘sample’---Select K centroid points randomly from X

‘uniform’---Stochastic generation of K centroids uniformly according to the distribution range of X

‘cluster’ – randomly select 10% of X subsamples in the initial clustering stage (this method initially uses the ‘sample’ method)

Matrix provides a K*P matrix as the initial centroid position set

3. ’Replicates’---The number of cluster repetitions is an integer

MATLAB code:

The basic idea of ​​the %KMeans algorithm is to randomly give K cluster centers at the initial random, and divide the sample points to be classified into each cluster according to the principle of closest neighborhood.

%The centroids of each cluster are then recalculated according to the average method to determine the new cluster center. Iterates until the movement distance of the cluster center is less than a given value.

% Randomly get 200 points

X = [randn(50,2)+[-ones(50,1), +ones(50,1)]; randn(50,2)+[ones(50,1), ones(50,1)]; 。。。

randn(50,2)+[ones(50,1), -ones(50,1)]; randn(50,2)+[-ones(50,1),-ones(50,1)]];

%{

The kmeans function in MATLAB uses the division of N*P matrix X into K classes, so that the distance between objects in the class is the largest and the distance between classes is the smallest.

How to use:

Idx = Kmeans(X,K)

[Idx,C] = Kmeans(X,K)

[Idc,C,sumD] = Kmeans(X,K)

[Idx,C,sumD,D] = Kmeans(X,K)

Introduction to each input and output parameters:

X---N*P data matrix

K--- means to divide X into several categories and be an integer

Idx---N*1 vector, which stores the clustering number of each point

Ctrs---K*P matrix, which stores K cluster centroid positions

sumD----The sum vector of 1*K, which stores the sum of distances between all points between classes and the center of mass points of this class.

D---N*K matrix, which stores the distance between each point and all centroids

%}

opts = statset(‘Display’,‘final’);

[Idx,Ctrs,SumD,D] = kmeans(X,4,‘Replicates’,4,‘Options’,opts);

% Draw points with clustering of 1.

% X(Idx==1, 1), is the first coordinate of the first class sample; X(Idx==1, 2) is the second coordinate of the first class sample

plot(X(Idx==1,1), X(Idx==1,2), ‘r.’, ‘MarkerSize’, 14);

hold on;

plot(X(Idx==2,1), X(Idx==2,2), ‘b.’, ‘MarkerSize’, 14);

hold on;

plot(X(Idx==3,1), X(Idx==3,2), ‘g.’, ‘MarkerSize’, 14);

hold on;

plot(X(Idx==4,1), X(Idx==4,2), ‘y.’, ‘MarkerSize’, 14);

hold on;

% Draw the cluster center point, kx means it is a cross symbol

plot(Ctrs(:,1), Ctrs(:,2), ‘kx’, ‘MarkerSize’, 14, ‘LineWidth’, 4);

legend(‘Cluster 1’, ‘Cluster 2’, ‘Cluster 3’, ‘Cluster 4’, ‘Centroids’, ‘Location’, ‘NW’);

grid on;

%{

[┈] = Kmeans(┈,’Param1’,’Val1’,’Param2’,’Val2’,┈)

The parameters Param1, Param2, etc. can be mainly set as follows:

1. ‘Distance’--Distance Measurement

‘sqEuclidean’--Euclidean distance

‘cityblock’--Absolute error sum, also known as L1

‘cosine’---For vectors

‘correlation’--values ​​for time-sequential relationships

‘Hamming’---only for binary data

2. ‘Start’---Select method of initial centroid position

‘sample’---Select K centroid points randomly from X

‘uniform’---Stochastic generation of K centroids uniformly according to the distribution range of X

‘cluster’ – randomly select 10% of X subsamples in the initial clustering stage (this method initially uses the ‘sample’ method)

Matrix provides a K*P matrix as the initial centroid position set

3. ‘Replicates’---The number of cluster repetitions is an integer

%}

Open the APP to read more exciting content

Click to read the full text