Blog

Data Ecosystem

The various elements that interact with one another in order to produce, manage, store, organize, analyze and share data

To put it simply, an ecosystem is a group of elements that interact with one another. Ecosystems can be large, like the jungle in a tropical rainforest or the Australian outback.

Or, tiny, like tadpoles in a puddle, or bacteria on your skin. And just like the kangaroos and koala bears in the Australian outback, data lives inside its own ecosystem too.

Data ecosystems are made up of various elements that interact with one another in order to produce, manage, store, organize, analyze, and share data.

These elements include hardware and software tools, and the people who use them.

Data can also be found in something called the cloud. The cloud is a place to keep data online, rather than on a computer hard drive.

So instead of storing data somewhere inside your organization’s network, that data is accessed over the internet.

exciting… right? more info at https://medium.com/@raj.ranjan.sinha/data-ecosystem-b1790b0cfb5


Data Lifecycle

The data life cycle is a framework that outlines the stages that data goes through from its initial creation or capture to its eventual deletion or archival. Here are the typical steps in the data life cycle:

  1. Data Generation/Capture:
  • This is the initial stage where data is created or captured. It can come from various sources such as sensors, user input, transactions, or any other form of data generation.

2. Data Ingestion:

  • Once data is generated, it needs to be collected and stored in a structured manner. This might involve processes like data extraction, transformation, and loading (ETL) for further processing.

3. Data Storage:

  • After ingestion, data needs to be stored in a secure and accessible location. This could be in a database, data warehouse, or other types of storage systems.

4. Data Processing:

  • This step involves manipulating, cleaning, and transforming the raw data into a format that is suitable for analysis. It may include tasks like data normalization, aggregation, and filtering.

5. Data Analysis:

  • Once the data is prepared, it can be analysed to extract meaningful insights. This is where various statistical and machine learning techniques are applied to uncover patterns, trends, or relationships within the data.

6. Data Visualization and Reporting:

  • The results of the analysis are often communicated through visualizations or reports. This step helps in presenting the insights in a format that is easy to understand and interpret.

7. Data Archiving/Retirement:

  • Over time, certain data may become less relevant for current analyses but may still need to be retained for legal or compliance reasons. Archiving involves moving data to a long-term storage solution.

8. Data Deletion/Disposition:

  • Eventually, there may come a point where data is no longer needed and can be safely deleted. This step is crucial to ensure that unnecessary data is not taking up resources and to maintain compliance with data protection regulations.

It’s worth noting that some variations of the data life cycle might include additional steps or break down these steps further depending on the specific needs and requirements of an organisation or project. Additionally, with the advent of big data technologies and advanced analytics, the processes involved in each step may become more complex and sophisticated.


Langages of Data Science

The languages of Data Science 

For anyone just getting started on their data science journey, the range of technical options can be overwhelming. There is a dizzying amount of choice when it comes to programming languages. Each has its own strengths and weaknesses and there is no one right answer to the question of which one you should learn first. The answer to that question depends largely on your needs, the problems you are trying to solve, and who you are solving them for. 

Python, R, and SQL are the languages that we recommend you consider first and foremost. But there are so many others that have their own strengths and features.  Scala, Java, C++, and Julia are some of the most popular. Javascript, PHP, Go, Ruby, and Visual Basic all have their own unique use cases as well. 

The language you choose to learn will depend on the things you need to accomplish and the problems you need to solve. It will also depend on what company you work for, what role you have, and the age of your existing application. We’ll explore the answers to this question as we dive into the popular languages in the data science industry.  There are many roles available for people who are interested in getting involved in data science. Business Analyst Database Engineer Data Analyst Data Engineer Data Scientist Research Scientist Software Engineer Statistician Product Manager Project Manager and many more.


Practical Consideration in K Means Algorithm

Let’s understand some of the factors that can impact the final clusters that you obtain from the K-means algorithm. This would also give you an idea about the issues that you must keep in mind before you start to make clusters to solve your business problem.

Thus, the major practical considerations involved in K-Means clustering are:

  • The number of clusters that you want to divide your data points into, i.e. the value of K has to be pre-determined.
  • The choice of the initial cluster centres can have an impact on the final cluster formation.
  • The clustering process is very sensitive to the presence of outliers in the data.
  • Since the distance metric used in the clustering process is the Euclidean distance, you need to bring all your attributes on the same scale. This can be achieved through standardisation.
  • The K-Means algorithm does not work with categorical data.
  • The process may not converge in the given number of iterations. You should always check for convergence.

You will understand some of these issues in detail and also see the ways to deal with them when you implement the K-means algorithm in Python.

Having understood the approach of choosing K for the K-Means algorithm, we will now look at silhouette analysis or silhouette coefficient. Silhouette coefficient is a measure of how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

So to compute silhouette metric, we need to compute two measures i.e. a(i) and b(i) where,

  • a(i) is the average distance from its own cluster(Cohesion).
  • b(i) is the average distance from the nearest neighbour cluster(Separation). 

Now, let’s look at how to combine cohesion and separation to compute the silhouette metric.

You can read more about K-Mode clustering here, We will be covering it in detail in the next section.

K-Means algorithm

Arrange the steps of k-means algorithm in the order in which they occur:

  1. Randomly selecting the cluster centroids
  2. Updating the cluster centroids iteratively
  3. Assigning the cluster points to their nearest center

1-3-2✓ CorrectFeedback:First the cluster centers are pre-decided. Then all the points are assigned to their nearest cluster center and then the center is recalculated as the mean of all the points which fall in that cluster. Then the clustering is repeated with the new centers and the centers are updated according to the new cluster points.


Steps of the Algorithm

Let’s go through the K-Means algorithm using a very simple example. Let’s consider a set of 10 points on a plane and try to group these points into, say, 2 clusters. So let’s see how the K-Means algorithm achieves this goal.

[Note: If you don’t know what is meant by Euclidean distance, you’re advised to go through this link]

Before moving ahead, think about the following problem. Let’s say you have the data of 10 students and their marks in Biology and Math (as shown in the plot below). You want to divide them into two clusters so that you can see what kind of students are there in the class.

The y-axis shows the marks in Biology, and the x-axis shows the marks in Math.

Imagine two clusters dividing this data — one red and the other yellow. How many points would each cluster have?

Fig 1: Random points to be divided into 2 clusters
Fig 1: Random points to be divided into 2 clusters

Centroid

The K-Means algorithm uses the concept of the centroid to create K clusters. Before you move ahead, it will be useful to recall the concept of the centroid.

In simple terms, a centroid of n points on an x-y plane is another point having its own x and y coordinates and is often referred to as the geometric centre of the n points.

For example, consider three points having coordinates (x1, y1), (x2, y2) and (x3, y3). The centroid of these three points is the average of the x and y coordinates of the three points, i.e.

(x1 + x2 + x3 / 3, y1 + y2 + y3 / 3).

Similarly, if you have n points, the formula (coordinates) of the centroid will be:

(x1+x2…..+xn / n, y1+y2…..+yn / n). 

So let’s see how the K-Means algorithm achieves this goal.

Each time the clusters are made, the centroid is updated. The updated centroid is the centre of all the points which fall in the cluster associated with the centroid. This process continues till the centroid no longer changes, i.e. the solution converges.

Thus, you can see that the K-means algorithm is a clustering algorithm that takes N data points and groups them into K clusters. In this example, we had N =10 points and we used the K-means algorithm to group these 10 points into K = 2 clusters.

Fig 2: Final cluster
Fig 2: Final cluster

Download the Excel file below. It is designed to give you the hands-on practice of the k-means clustering algorithm. The file contains a set of 10 points (with x and y coordinates in column A and B respectively) and two initial centres 1 and 2 (in columns F and G). Answer the questions below based on the Excel file.

K Means Algorithm

In the previous segment, we learned about K-means clustering and how the algorithm works using a simple example. We learned about how assignment and optimization work in K Means clustering, Now in this lecture, we will look at K-means more algorithmically. We will be learning how the K Means algorithm proceeds with the assignment step and then with the optimization step and will also be looking at the cost of function for the K-means algorithm.

Let’s understand the K-means algorithm in more detail.

From the previous lecture, we understood that the algorithm’s inner-loop iterates over two steps:

  1. Assign each observation Xi to the closest cluster centroid μk
  2. Update each centroid to the mean of the points assigned to it.

In the next lecture, we will learn about the Kmeans cost function and will also see how to compute the cost function for each iteration in the K-means algorithm.

So the cost function for the K-Means algorithm is given as: 

J=∑ni=1||Xi−μk(i)||2=∑Kk=1∑iϵCk||Xi−μk||2

Now in the next video, we will learn what exactly happens in the assignment step? and we will also look at how to assign each data point to a cluster using the K-Means algorithm assignment step.

[Note: At 1:43 where the Prof explains the optimization step, the values in the column –  μ1 and μ2 should be X1 and X2 ]

In the assignment step, we assign every data point to K clusters. The algorithm goes through each of the data points and depending on which cluster is closer, in our case, whether the green cluster centroid or the blue cluster centroid; It assigns the data points to one of the 2 cluster centroids.

The equation for the assignment step is as follows:

Zi=argmin||Xi−μk||2

Now having assigned each data point to a cluster, now we need to recompute the cluster centroids. In the next lecture, Prof.Dinesh will explain how to recompute the cluster centroids or the mean of each cluster.

In the optimization step, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.

The equation for optimization is as follows:

μk=1nk∑i:zi=kXi

The process of assignment and optimization is repeated until there is no change in the clusters or possibly until the algorithm converges.

[Note – The definition of Silhouette score contains an error in the link shared above]

In the next segment, we will learn how to look K-Means algorithm as a coordinate descent problem. We will also learn about the constraint of the K-Means cost function and how to achieve global minima.

K Means++ Algorithm

We looked in the previous segment that for K-Means optimisation problem, the algorithm iterates between two steps and tries to minimise the objective function given as,

Zi=argmin||Xi−μk||2

To choose the cluster centres smartly, we will learn about K-Mean++ algorithm. K-means++ is just an initialisation procedure for K-means. In K-means++ you pick the initial centroids using an algorithm that tries to initialise centroids that are far apart from each other.

To summarise, In K-Means++ algorithm,

  1. We choose one centre as one of the data points at random.
  2. For each data point Xi, We compute the distance between Xi and the nearest centre that had already been chosen.
  3. Now, we choose the next cluster centre using the weighted probability distribution where a point X is chosen with probability proportional to d(X)2 .
  4. Repeat Steps 2 and 3 until K centres have been chosen.

Visualising the K Means Algorithm

Let’s see the K-Means algorithm in action using a visualization tool. This tool can be found on naftaliharris.com. You can go to this link after watching the video below and play around with the different options available to get an intuitive feel of the K-Means algorithm.

Suppose you have implemented k-means and to check that it is running correctly, you plot the cost function J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_k)J(c(1),…,c(m),μ1​,…,μk​) as a function of the number of iterations. Your plot looks like this:

The cost function is bumpy but trends downward

What does this mean?

The learning rate is too large.

The algorithm is working correctly.

The algorithm is working, but kk is too large.

It is not possible for the cost function to sometimes increase. There must be a bug in the code.

Correct


Copyright © 2023 @rajeevranjansinha.com