Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point as its own cluster and iteratively merges the closest clusters until a stopping criterion is reached. It is a bottom-up approach that produces a dendrogram, which is a tree-like diagram that shows the hierarchical relationship between the clusters. The algorithm can be implemented using the scikit-learn library in Python.
Implementation in Python
We will use the iris dataset for demonstration. The first step is to import the necessary libraries and load the dataset.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage iris = load_iris() X = iris.data y = iris.target
The next step is to create a linkage matrix that contains the distances between each pair of clusters. We can use the linkage function from the scipy.cluster.hierarchy module to create the linkage matrix.
Z = linkage(X, ''ward'')
The ”ward” method is used to calculate the distances between the clusters. It minimizes the variance of the distances between the clusters being merged.
We can visualize the dendrogram using the dendrogram function from the same module.
plt.figure(figsize=(7.5, 3.5)) plt.title("Iris Dendrogram") dendrogram(Z) plt.show()
The resulting dendrogram (see the following plot) shows the hierarchical relationship between the clusters. We can see that the algorithm has merged the closest clusters first, and the distance between the clusters increases as we move up the tree.
The final step is to apply the clustering algorithm and extract the cluster labels. We can
use the AgglomerativeClustering class from the sklearn.cluster module to apply the
algorithm.
model = AgglomerativeClustering(n_clusters=3) model.fit(X) labels = model.labels_
The n_clusters parameter specifies the number of clusters to be extracted from the data. In this case, we have specified n_clusters=3 because we know that the iris dataset has three classes.
We can visualize the resulting clusters using a scatter plot.
plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:, 0], X[:, 1], c=labels) plt.xlabel("Sepal length") plt.ylabel("Sepal width") plt.title("Agglomerative Clustering Results") plt.show()
The resulting plot shows the three clusters identified by the algorithm. We can see that the algorithm has successfully separated the data points into their respective classes.
Example
Here is the complete implementation of Agglomerative Clustering in Python −
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage # Load the Iris dataset iris = load_iris() X = iris.data y = iris.target Z = linkage(X, ''ward'') # Plot the dendogram plt.figure(figsize=(7.5, 3.5)) plt.title("Iris Dendrogram") dendrogram(Z) plt.show() # create an instance of the AgglomerativeClustering class model = AgglomerativeClustering(n_clusters=3) # fit the model to the dataset model.fit(X) labels = model.labels_ # Plot the results plt.figure(figsize=(7.5, 3.5)) plt.scatter(X[:, 0], X[:, 1], c=labels) plt.xlabel("Sepal length") plt.ylabel("Sepal width") plt.title("Agglomerative Clustering Results") plt.show()
Advantages of Agglomerative Clustering
Following are the advantages of using Agglomerative Clustering −
-
Produces a dendrogram that shows the hierarchical relationship between the clusters.
-
Can handle different types of distance metrics and linkage methods.
-
Allows for a flexible number of clusters to be extracted from the data.
-
Can handle large datasets with efficient implementations.
Disadvantages of Agglomerative Clustering
Following are some of the disadvantages of using Agglomerative Clustering −
-
Can be computationally expensive for large datasets.
-
Can produce imbalanced clusters if the distance metric or linkage method is not appropriate for the data.
-
The final result may be sensitive to the choice of distance metric and linkage method used.
-
The dendrogram may be difficult to interpret for large datasets with many clusters.