Blog

Langages of Data Science

The languages of Data Science 

For anyone just getting started on their data science journey, the range of technical options can be overwhelming. There is a dizzying amount of choice when it comes to programming languages. Each has its own strengths and weaknesses and there is no one right answer to the question of which one you should learn first. The answer to that question depends largely on your needs, the problems you are trying to solve, and who you are solving them for. 

Python, R, and SQL are the languages that we recommend you consider first and foremost. But there are so many others that have their own strengths and features.  Scala, Java, C++, and Julia are some of the most popular. Javascript, PHP, Go, Ruby, and Visual Basic all have their own unique use cases as well. 

The language you choose to learn will depend on the things you need to accomplish and the problems you need to solve. It will also depend on what company you work for, what role you have, and the age of your existing application. We’ll explore the answers to this question as we dive into the popular languages in the data science industry.  There are many roles available for people who are interested in getting involved in data science. Business Analyst Database Engineer Data Analyst Data Engineer Data Scientist Research Scientist Software Engineer Statistician Product Manager Project Manager and many more.


Practical Consideration in K Means Algorithm

Let’s understand some of the factors that can impact the final clusters that you obtain from the K-means algorithm. This would also give you an idea about the issues that you must keep in mind before you start to make clusters to solve your business problem.

Thus, the major practical considerations involved in K-Means clustering are:

  • The number of clusters that you want to divide your data points into, i.e. the value of K has to be pre-determined.
  • The choice of the initial cluster centres can have an impact on the final cluster formation.
  • The clustering process is very sensitive to the presence of outliers in the data.
  • Since the distance metric used in the clustering process is the Euclidean distance, you need to bring all your attributes on the same scale. This can be achieved through standardisation.
  • The K-Means algorithm does not work with categorical data.
  • The process may not converge in the given number of iterations. You should always check for convergence.

You will understand some of these issues in detail and also see the ways to deal with them when you implement the K-means algorithm in Python.

Having understood the approach of choosing K for the K-Means algorithm, we will now look at silhouette analysis or silhouette coefficient. Silhouette coefficient is a measure of how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

So to compute silhouette metric, we need to compute two measures i.e. a(i) and b(i) where,

  • a(i) is the average distance from its own cluster(Cohesion).
  • b(i) is the average distance from the nearest neighbour cluster(Separation). 

Now, let’s look at how to combine cohesion and separation to compute the silhouette metric.

You can read more about K-Mode clustering here, We will be covering it in detail in the next section.

K-Means algorithm

Arrange the steps of k-means algorithm in the order in which they occur:

  1. Randomly selecting the cluster centroids
  2. Updating the cluster centroids iteratively
  3. Assigning the cluster points to their nearest center

1-3-2✓ CorrectFeedback:First the cluster centers are pre-decided. Then all the points are assigned to their nearest cluster center and then the center is recalculated as the mean of all the points which fall in that cluster. Then the clustering is repeated with the new centers and the centers are updated according to the new cluster points.


Steps of the Algorithm

Let’s go through the K-Means algorithm using a very simple example. Let’s consider a set of 10 points on a plane and try to group these points into, say, 2 clusters. So let’s see how the K-Means algorithm achieves this goal.

[Note: If you don’t know what is meant by Euclidean distance, you’re advised to go through this link]

Before moving ahead, think about the following problem. Let’s say you have the data of 10 students and their marks in Biology and Math (as shown in the plot below). You want to divide them into two clusters so that you can see what kind of students are there in the class.

The y-axis shows the marks in Biology, and the x-axis shows the marks in Math.

Imagine two clusters dividing this data — one red and the other yellow. How many points would each cluster have?

Fig 1: Random points to be divided into 2 clusters
Fig 1: Random points to be divided into 2 clusters

Centroid

The K-Means algorithm uses the concept of the centroid to create K clusters. Before you move ahead, it will be useful to recall the concept of the centroid.

In simple terms, a centroid of n points on an x-y plane is another point having its own x and y coordinates and is often referred to as the geometric centre of the n points.

For example, consider three points having coordinates (x1, y1), (x2, y2) and (x3, y3). The centroid of these three points is the average of the x and y coordinates of the three points, i.e.

(x1 + x2 + x3 / 3, y1 + y2 + y3 / 3).

Similarly, if you have n points, the formula (coordinates) of the centroid will be:

(x1+x2…..+xn / n, y1+y2…..+yn / n). 

So let’s see how the K-Means algorithm achieves this goal.

Each time the clusters are made, the centroid is updated. The updated centroid is the centre of all the points which fall in the cluster associated with the centroid. This process continues till the centroid no longer changes, i.e. the solution converges.

Thus, you can see that the K-means algorithm is a clustering algorithm that takes N data points and groups them into K clusters. In this example, we had N =10 points and we used the K-means algorithm to group these 10 points into K = 2 clusters.

Fig 2: Final cluster
Fig 2: Final cluster

Download the Excel file below. It is designed to give you the hands-on practice of the k-means clustering algorithm. The file contains a set of 10 points (with x and y coordinates in column A and B respectively) and two initial centres 1 and 2 (in columns F and G). Answer the questions below based on the Excel file.

K Means Algorithm

In the previous segment, we learned about K-means clustering and how the algorithm works using a simple example. We learned about how assignment and optimization work in K Means clustering, Now in this lecture, we will look at K-means more algorithmically. We will be learning how the K Means algorithm proceeds with the assignment step and then with the optimization step and will also be looking at the cost of function for the K-means algorithm.

Let’s understand the K-means algorithm in more detail.

From the previous lecture, we understood that the algorithm’s inner-loop iterates over two steps:

  1. Assign each observation Xi to the closest cluster centroid μk
  2. Update each centroid to the mean of the points assigned to it.

In the next lecture, we will learn about the Kmeans cost function and will also see how to compute the cost function for each iteration in the K-means algorithm.

So the cost function for the K-Means algorithm is given as: 

J=∑ni=1||Xi−μk(i)||2=∑Kk=1∑iϵCk||Xi−μk||2

Now in the next video, we will learn what exactly happens in the assignment step? and we will also look at how to assign each data point to a cluster using the K-Means algorithm assignment step.

[Note: At 1:43 where the Prof explains the optimization step, the values in the column –  μ1 and μ2 should be X1 and X2 ]

In the assignment step, we assign every data point to K clusters. The algorithm goes through each of the data points and depending on which cluster is closer, in our case, whether the green cluster centroid or the blue cluster centroid; It assigns the data points to one of the 2 cluster centroids.

The equation for the assignment step is as follows:

Zi=argmin||Xi−μk||2

Now having assigned each data point to a cluster, now we need to recompute the cluster centroids. In the next lecture, Prof.Dinesh will explain how to recompute the cluster centroids or the mean of each cluster.

In the optimization step, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.

The equation for optimization is as follows:

μk=1nk∑i:zi=kXi

The process of assignment and optimization is repeated until there is no change in the clusters or possibly until the algorithm converges.

[Note – The definition of Silhouette score contains an error in the link shared above]

In the next segment, we will learn how to look K-Means algorithm as a coordinate descent problem. We will also learn about the constraint of the K-Means cost function and how to achieve global minima.

K Means++ Algorithm

We looked in the previous segment that for K-Means optimisation problem, the algorithm iterates between two steps and tries to minimise the objective function given as,

Zi=argmin||Xi−μk||2

To choose the cluster centres smartly, we will learn about K-Mean++ algorithm. K-means++ is just an initialisation procedure for K-means. In K-means++ you pick the initial centroids using an algorithm that tries to initialise centroids that are far apart from each other.

To summarise, In K-Means++ algorithm,

  1. We choose one centre as one of the data points at random.
  2. For each data point Xi, We compute the distance between Xi and the nearest centre that had already been chosen.
  3. Now, we choose the next cluster centre using the weighted probability distribution where a point X is chosen with probability proportional to d(X)2 .
  4. Repeat Steps 2 and 3 until K centres have been chosen.

Visualising the K Means Algorithm

Let’s see the K-Means algorithm in action using a visualization tool. This tool can be found on naftaliharris.com. You can go to this link after watching the video below and play around with the different options available to get an intuitive feel of the K-Means algorithm.

Suppose you have implemented k-means and to check that it is running correctly, you plot the cost function J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_k)J(c(1),…,c(m),μ1​,…,μk​) as a function of the number of iterations. Your plot looks like this:

The cost function is bumpy but trends downward

What does this mean?

The learning rate is too large.

The algorithm is working correctly.

The algorithm is working, but kk is too large.

It is not possible for the cost function to sometimes increase. There must be a bug in the code.

Correct


Centroid

The next concept that is crucial for understanding how clustering generally works is the idea of centroids. If you remember your high school geometry, centroids are essentially the centre points of triangles. Similarly, in the case of clustering, centroids are the center points of the clusters that are being formed.

 

Now before going to the formula part, here is an intuition for the need of a centroid. Imagine you have the following clusters of the marks of a group of students in Mathematics and Biology and someone asks you to explain them. From a glance, you can easily interpret the 4 clusters that are being formed.

 

 

 

 

So the four clusters that are being formed are as follows:

 

Cluster 1: Students who have scored high marks in Bio, but poor marks in Maths
Cluster 2: Students who have scored average  marks in Bio and  Maths
Cluster 3: Students who have scored high marks in both Bio and Maths
Cluster 4: Students who have scored high marks in Maths, but poor marks in Bio

 

Now the above representation is fine and correct, but it is missing one crucial information – the numerical order. For example, when you want to compare two clusters say Cluster 1 and Cluster 2 can you say by how much marks on average do the students from Cluster 1 outperform or underperform the Cluster 2 students in a particular subject just by taking a look at the above visualisation alone? Is it by 10 marks? Or 15?

 

This is where the concept of Centroids comes in handy. Listen to the following lecture to understand its importance and how it is calculated.

 

Therefore, as mentioned in the video, the Centroids are essentially the cluster centers of a group of observations that help us in summarising the cluster’s properties. Thus as you saw in the video, the centroid value in the case of clustering is essentially the mean of all the observations that belong to a particular cluster. For example, in the dataset that you saw here,

 

 

 

The centroid is calculated by computing the mean of each and every column/dimension that you have and then ordering them in the same way as above.

Therefore, Height-mean = ((175+165+183+172)/)/4  = 173.75
Weight-mean = ((83+74+98+80))/4 = 83.75
Age – mean = ((22+25+24+24))/4 =23.75

Thus the centroid of the above group of observations is (173.75, 83.75 and 23.75)


K-Means clustering

Euclidean Distance

In the previous segments, you got an idea about how clustering works – it groups the objects on the basis of their similarity or closeness to each other.

 

Now, the next important thing is to get into the nitty-gritty of how clustering algorithms generally work. You will learn about the 2 types of clustering methods – K-means and Hierarchical and how they go about doing the clustering process.

 

We have learnt that clustering works on the basis of grouping the observations which are the most similar to each other. What does this exactly mean?

 

In simple terms, the algorithm needs to find data points whose values are similar to each other and therefore these points would then belong to the same cluster. The method in which any clustering algorithm goes about doing that is through the method of finding something called a “distance measure”. The distance measure that is used in K-means clustering is called the Euclidean Distance measure. Let’s look at the following lecture to understand how this value is calculated.

The Euclidean Distance between the 2 points is measured as follows: If there are 2 points X and Y having n dimensions

 

 

Then the Euclidean Distance D is given as

 

 

The idea of distance measure is quite intuitive. Essentially, the observations which are closer or more similar to each other would have a low Euclidean distance and the observations which are farther or less similar to each other would have a higher Euclidean distance. So can you now guess how the Clustering process would work based on the Euclidean distance?


Copyright © 2022 @rajeevranjansinha.com