Let’s understand some of the factors that can impact the final clusters that you obtain from the K-means algorithm. This would also give you an idea about the issues that you must keep in mind before you start to make clusters to solve your business problem.
Thus, the major practical considerations involved in K-Means clustering are:
- The number of clusters that you want to divide your data points into, i.e. the value of K has to be pre-determined.
- The choice of the initial cluster centres can have an impact on the final cluster formation.
- The clustering process is very sensitive to the presence of outliers in the data.
- Since the distance metric used in the clustering process is the Euclidean distance, you need to bring all your attributes on the same scale. This can be achieved through standardisation.
- The K-Means algorithm does not work with categorical data.
- The process may not converge in the given number of iterations. You should always check for convergence.
You will understand some of these issues in detail and also see the ways to deal with them when you implement the K-means algorithm in Python.
Having understood the approach of choosing K for the K-Means algorithm, we will now look at silhouette analysis or silhouette coefficient. Silhouette coefficient is a measure of how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).
So to compute silhouette metric, we need to compute two measures i.e. a(i) and b(i) where,
- a(i) is the average distance from its own cluster(Cohesion).
- b(i) is the average distance from the nearest neighbour cluster(Separation).
Now, let’s look at how to combine cohesion and separation to compute the silhouette metric.
You can read more about K-Mode clustering here, We will be covering it in detail in the next section.
K-Means algorithm
Arrange the steps of k-means algorithm in the order in which they occur:
- Randomly selecting the cluster centroids
- Updating the cluster centroids iteratively
- Assigning the cluster points to their nearest center
1-3-2✓ CorrectFeedback:First the cluster centers are pre-decided. Then all the points are assigned to their nearest cluster center and then the center is recalculated as the mean of all the points which fall in that cluster. Then the clustering is repeated with the new centers and the centers are updated according to the new cluster points.
Thanks for this detailed explaination.