What is Unsupervised Learning?
In unsupervised machine learning algorithms, we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.
Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neighbors etc.
In regression, we train the machine to predict a future value. In classification, we train the machine to classify an unknown object in one of the categories defined by us. In short, we have been training machines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it would be difficult for us to train the machine using supervised learning. What if the machine can look up and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so many distinct categories?
As an example, consider the voter’s data. By considering some inputs from each voter (these are called features in AI terminology), let the machine predict that there are so many voters who would vote for X political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given a huge set of data points X, “What can you tell me about X?”. Or it may be a question like “What are the five best groups we can make out of X?”. Or it could be even like “What three features occur together most frequently in X?”.
This is exactly the Unsupervised Learning is all about.
Algorithms for Unsupervised Learning
Let us now discuss one of the widely used algorithms for classification in unsupervised machine learning.
k-means clustering
The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percentage of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage of the voters were to have switched sides, the outcome of the election would have been different. There are small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge, but with such close races, they may be big enough to change the outcome of the election. How do you find these groups of people? How do you appeal to them with a limited budget? The answer is clustering.
Let us understand how it is done.
-
First, you collect information on people either with or without their consent: any sort of information that might give some clue about what is important to them and what will influence how they vote.
-
Then you put this information into some sort of clustering algorithm.
-
Next, for each cluster (it would be smart to choose the largest one first) you craft a message that will appeal to these voters.
-
Finally, you deliver the campaign and measure to see if it’s working.
Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like automatic classification. You can cluster almost anything, and the more similar the items are in the cluster, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean of the values in that cluster.
Cluster Identification
Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me about those groups.” The key difference from classification is that in classification you know what you are looking for. While that is not the case in clustering.
Clustering is sometimes called unsupervised classification because it produces the same result as classification does but without having predefined classes.
Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes: Clustering, Association, Dimensionality Reduction, and Anomaly Detection.
Clustering
Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find similarity as well as relationship patterns among data samples and then cluster those samples into groups having similarity based on features. The real-world example of clustering is to group the customers by their purchasing behavior.
Association
Another useful unsupervised ML method is Association which is basically used to analyze large dataset to find patterns which further represent the interesting relationships between various items. It is also termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer shopping patterns.
Dimensionality Reduction
As the name suggests, this unsupervised ML method is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features.
A question arises here is that, why we need to reduce the dimensionality? The reason behind this is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to “curse of dimensionality”. PCA (Principal Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms for this purpose.
Anomaly Detection
This unsupervised ML method is used to find out occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or a normal data point.
Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its features.