What are the methods to choose the value of K in k-means algorithm?
Question
There are I guess at least 2 methods
in progress
1
Statistics
55 years
7 Answers
2955 views
Grand Master 1
Answers ( 7 )
The k-means algorithm tries to find the optimum no of clusters.
To determine this, it makes use of WSS (Within-Cluster Sum of Squares) metric,
which is calculated as squared distance of each point from the centroid of the cluster the point belongs to.
We generally find the optimum no of clusters from the elbow plot which is a plot of
the no of clusters against the WSS. This plot take the shape of an arm and the optimum no of clusters
is the point where the arm bends at the elbow.
When you start with one cluster, the WSS is highest and it goes on decreasing as the no of clusters increases.
There will be a point where adding an additional cluster won’t give a substantial reduction in WSS.
That point should be your value of k i.e. optimum no of clusters.
I am familiar with two ways to find K values in K-means.
1) Elbow Method – It is based on the concept that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.
we compute the cluster using k-means for different k values by varying k from 1 to say 10 clusters. For each K, we calculate the total WSS (within-cluster sum of square distance). Then we plot the curve of WSS with respect to the K value. The location of the lowest WSS (knee point in a curve) is considered for the optimal value of K (respective K value).
You can see the sample plot of the elbow method in the above image uploaded by swaplaw007.
2) The average silhouette method – it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
We compute cluster using k-means algorithm for different k values vary from 1 to n. For each k value, we calculate the average silhouette of observations. Plot the curve of average silhouette score with their respective k values. The location where the average silhouette is maximum is considered as optimal value of K.
At the End, varying K value depends on the problem statement, you are solving. Each problem has different requirements based on business requirements. Knowing the domain of the problem, can help to decide the range of K when you are handling large dataset.
The methods to choose the value of k in k mean algorithms are :-
1. Silhoutte coefficient : is a measure of how close each data points in one cluster to the points in another cluster
which is equal to b-a/max(b-a) where b is the distance of data point in one cluster to the centroid of another cluster
and the value of b is expected to be high. In sc we are going by the purity of clusters based on the distance.
2. Elbow Analysis : This is used for validation and interpretation whether by increasing the number of clusters we are getting any benefit or not means if there is any change in the variance as variance explained increases or not if not then we will not increase the number of clusters further.
There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the elbow point.
1. ELBOW METHOD:
The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible. The Elbow method looks at the total WSS as a function of the number of clusters. The optimal number of clusters can be defined as follow:
1) Compute clustering algorithm (e.g., k-means clustering) for different values of k
2) For each k, calculate the total within-cluster sum of square (WSS)
3) Plot the curve of WSS according to the number of clusters k
4) The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters
2. AVERAGE SILHOUTEE METHOD:
Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.
1) Compute clustering algorithm
2) For each k, calculate the average silhouette of observations (avg.sil)
3) Plot the curve of avg.sil according to the number of clusters k
4) The location of the maximum is considered as the appropriate number of clusters
Elbow method
Recall that, the basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.
The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.
The optimal number of clusters can be defined as follow:
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (wss).
3. Plot the curve of wss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
the elbow method is sometimes ambiguous. An alternative is the average silhouette method.
Average silhouette method:
it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
The algorithm is similar to the elbow method and can be computed as follow:
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the average silhouette of observations (avg.sil).
3. Plot the curve of avg.sil according to the number of clusters k.
4. The location of the maximum is considered as the appropriate number of clusters.
Gap statistic method:
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
I am aware of elbow method : We can plot the within-cluster variation as per different values of ‘k’. Since clusters should be homogenous, the variation should be minimized. It is seen happening at the elbow of the curve.