Learning Data Science: Day 23 - Unsupervised Learning, Clustering, and k-Means Algorithm

In most of our previous stories, we have talked about Supervised Learning. Today, we are going to talk about the opposite of that which is unsupervised learning.

Unsupervised Learning

Usually, in supervised learning, we have two datasets. The first dataset is a pool of the available information. Meanwhile, the second one is the list of classes. Here is an example. Let’s say we have several news articles, these are called the information dataset. For the classification, we would have the category of the articles such as news, sports, business, etc. These categories of articles is what we called the classes.

In unsupervised learning, we don’t have the list of classes for the classification. Instead, we only have the information dataset. The idea of unsupervised learning is to find patterns in unlabeled data. Basically, they cluster those data points. That’s why it’s also called clustering. The unsupervised learning usually used when the labeled data is hard to find or just too expensive. By using unsupervised learning, it will eliminate the need for labeled data.

The provided image is one of the examples of unsupervised learning. The learning method doesn’t have any target classes and they would identify the classes based on the pattern of the data. One of the advantages of using unsupervised learning is that we usually can found new classes that we don’t know before.

Example

The google image search is one of the examples. If we search the term “data”, then the google image search will give several suggestions for similar topics. Those suggestions seem to be kind of random, it is because it uses an unsupervised learning.

Example of Unsupervised Learning application

K-means algorithm

One of the examples of unsupervised learning is the k-means algorithm. Let’s check how K-means works.

Initialization

First, we need to define random k data points. We can use as many k as we want. Then those data points will act as the center of the clusters. Here is an example of the initialization of k-means. In this case, the k is equal to 2.

Notice the blue and red “X” mark. That’s where we first initialize the centers of our clusters. At this point, the cluster centers are still pointless, because they seem not at the right position. Because naturally we want the centers of the clusters to be in somewhere around (-1, -1) and (1, 1).

Update

First, based on those k-initialization points, we would need to identify which data points are close to each k-initialization points. We would have something like boundary line in the picture on the middle. After that, we move those k-points to the center of the nearest mean cluster points like in the picture on the right.

Then we iterate the process multiple times until we got the k-initialization points to the correct cluster (convergence) just like the picture below.

K-means are very sensitive to outliers. You can imagine the update process we usually take the mean cluster points as the next center of a cluster. If there are outliers, the mean can be biased and will not led to the appropriate center of the cluster. The solution for this problem is that we can actually use median instead of mean. In k-means, choosing the number of clusters (k) is important. So, how to choose the right k?

Knee or Elbow Method

One of the methods of choosing k is by using the Knee or Elbow method. Basically, the important thing to note when choosing k is that we only increase the k when increasing the k give so much improvement to model the data. This method is going to need a bit of our intuition. To give a better representation, here is an illustration.

The scattered data points on the top right corner are all of the available data points. Then the horizontal axis represents the number of clusters. Meanwhile, the vertical axis represents the groups’ sum of squares. We can see that the difference between k=1 and k=2 is significant. When we increase k=2 to k=3 is still significant. What about increasing the k further? It seems it’s not as significant anymore. When we notice that increasing to k = 4 is not significant anymore, we can just stop in k =3.

Cross Validation

To apply the clustering solution to the unseen data, we can always use cross validation. Because cross-validation already explained in other stories I’m not going to explain how to do it in this story.

Wrap Up

Today we have talked about the introduction to unsupervised learning and clustering. We also talked about one of its famous algorithm, k-means. Tomorrow, we will explore more about the k-means algorithm and unsupervised learning. Thank you for reading.

References

Half Data Engineer, Half Software Engineer