k-means Clustering: Elbow Method To Find Optimal k


Introduction


k-means clustering is one of the simplest unsupervised learning algorithm. It is the most commonly used clustering algorithm. It works in a manner that tries to locate centers of a cluster. We say this center as the centroid. The algorithm initiate by assigning each data point to the nearest centroid. Lastly, the algorithm stops when there is no further change in clusters.

Clustering algorithms partition datasets into distinct groups. We say these groups as clusters. Each group contains very similar items. These algorithms group together data points with similar characteristics. It is one of the unsupervised learning approaches. Normally, applied to the data which lacks in label information. Due to this, we don't know what is the right output.

Clustering:


Basically, in supervised learning, the user is already knowing the type of output, How it gonna be. It also knows about class labels. The user is handy with the target variable from a given dataset. Conversely, unsupervised learning, we are left with is just a bunch of sets of features. And, we are totally unknown about what target variable would be. What class labels of the dataset are going to? what happens, is that we identify the underlying of that data. We try to find a group of data that have many things in common. Thus we try to figure out clusters out of the dataset.

k-means:


k-means clustering is a supervised learning algorithm. It is one of the simple and popular algorithms. As the name suggests, it forms the clusters of data in a dataset.

Let's assume, we have a dataset, pretty much scattered around the plane XY. Whereas X and Y axes represent two different features of data points.
Image contains a scatter plot of dataset having a smaller number of data points wirh only two features.
Dataset before applying k-means Clustering
You may think that the underlying dataset is so simple. We can form clusters by just looking at the plot. So, why we need k-means. But, the plot is representing sample data on a very small scale. Actually, datasets of real file problems are much bigger and complex.

 Now the objective is to identify the clusters in this dataset. But, the data is unlabeled and we don't have any information on target variables. All we firstly want to do is to identify some sort of structure into the data. Normally, the user can separate the data points just looking towards the scatter plot. But, we can do this only for smaller datasets. Even, we cannot do that so accurately. After all, it would be a basic visual analysis. But, k-means helps us to identify these clusters.

k in k-means:


k of k-means is a parameter. Before heading towards the execution of the algorithm the user needs to tell the algorithm, the value of K. The user specifies the value of k. Suppose the k in a specific case is 2. So initially the goal will be to identify the two random points. We consider those 2 points as the centroid of those two clusters. The centroid is basically a point lying at the very center of geometric figures. But for us, it is the point central to all other points in clusters.
Scatter plot showing two distinct formed out of given dataset.
Formation of two clusters after the k-means


Now, as we have k=2 so the number of centroids and hence the clusters are two. Let's say, k is 'n' then there will be n random points acting as centers. The user can place These centroids anywhere at the start. Firstly, the task is to identify the distance between centroid and data points. Hence, if a point is closer to one of our centroid than the other, we say it belongs to that first centroid. As if some other data point is closer to the second centroid then that data point will belong to it. So, after getting this process does. The user will be left with two imperfect clusters. Both are based on our centroids.

Redefining the Clusters:


Next, the user tries to improve these clusters. The goal is to make'em better and more defined at every stage. The way we do this is by redefining the cluster. We try to adjust the centroid for both clusters. The user redefines centroids in a manner that they will lie exactly at the center of the cluster. Physically, we place them at the center of gravity of the respective cluster. After the redefinition, we again perform the same earlier process. The user measures distances of data points again and groups them accordingly. Thus we get a new cluster. Because of new centroids. We keep on repeating this process sequentially. Just we redefine centroid, recalculate the distance of the centroid from individual data point, and readjust the clusters.

This process will continue until no data point changes the cluster. After reaching this particular point. We did it! Therefore, even if the user tries to recompute, none of these data points going to reposition. They do not change their position with respect to the cluster. Now, you can say that Clustering is complete. And, the user obtains the final clusters.

k-means Clustering: Elbow Method To Find Optimal k


Here, the user needs to know the importance of K. The user is going to provide k to your algorithm. As the number of features increases, it becomes harder to visualize the scatterplot for that data. So, it becomes very important to supply the efficient number of clusters. And, the optimal k. To obtain the optimal value of k we use the Elbow method.

Elbow Method:


The technique to get the correct k is the Elbow method. At the start, this method proceeds with some value of k. In this method, the user tries to calculate the SSE. SSE is the Sum of Squared Errors. It means, for each cluster the user tries to compute the distance of individual data points from the centroid. After getting these distance values, the user makes the square of each and add together.
figure showing the line plot of  k values against SSE
Line plot to get the optimal value of k

 For the first cluster, the user computes SSE1 for the second SSE2and as per the user assumes k value, it computes up to SSEn. And after summing up all SSE for each cluster the user will get total SSE.

With that number, the user draws a plot with SSE at Y-axis and different k values at X-axis. After analyzing and observing the plot carefully, the user will get to know that, with increasing k, SSE reduces. And for some value of k, you will get SSE value nearest to 0. Thus error will keep on reducing with an increase in k. As per the name of the method, the elbow, the user needs to find the same shape in the line graph. That point will hold an optimal k value for the dataset.

At a glance


  • k-means is easy to understand.
  • It is easy to implement, clustering algorithm.
  • The k-means algorithm runs fast compared to other clustering algorithms.
  • It can be used parallel with decomposition methods like NMF and PCA.
  • It is easily scalable to handle very large datasets.
  • The k-means algorithm fails to identify clusters with complex shapes.

Powered by Blogger.