Wednesday, 12 February 2025

Agglomerative and Divisive Clustering in Hierarchical Clustering

Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. It does not require specifying the number of clusters in advance and is particularly useful for understanding the structure of data. It is mainly divided into Agglomerative Clustering (Bottom-Up Approach) and Divisive Clustering (Top-Down Approach).

1. Agglomerative Clustering (Bottom-Up Approach)

Agglomerative clustering starts with each data point as an individual cluster and iteratively merges the closest clusters until only one cluster remains.

Steps in Agglomerative Clustering:

  1. Initialize each data point as a separate cluster.
  2. Compute pairwise distances between clusters.
  3. Merge the two closest clusters based on a linkage criterion.
  4. Repeat steps 2-3 until all points belong to a single cluster or until a desired number of clusters is reached.
  5. Cut the dendrogram at the chosen level to obtain the final clusters.

Types of Linkage Methods:

  • Single Linkage: Merges clusters based on the minimum distance between points.
  • Complete Linkage: Uses the maximum distance between points.
  • Average Linkage: Considers the average distance between all pairs of points in clusters.
  • Ward’s Method: Minimizes variance within clusters.

Advantages of Agglomerative Clustering:

  1. No need to predefine the number of clusters.
  2. Suitable for small to medium-sized datasets.
  3. Produces a dendrogram, which helps in deciding the optimal number of clusters.

Disadvantages:

  1. Computationally expensive for large datasets (O(n²) complexity).
  2. Sensitive to noise and outliers.

2. Divisive Clustering (Top-Down Approach)

Divisive clustering starts with all data points in a single cluster and recursively splits the cluster into smaller clusters until each data point is its own cluster.

Steps in Divisive Clustering:

  1. Consider all data points as a single cluster.
  2. Use a clustering algorithm (e.g., k-Means) to divide the cluster into two sub-clusters.
  3. Recursively repeat step 2 on each cluster until each point is in its own cluster or the stopping criterion is met.
  4. The resulting dendrogram is then cut at an appropriate level to define the final clusters.

Advantages of Divisive Clustering:

  1. More accurate in some cases, as it doesn't suffer from early erroneous merges.
  2. Can be more meaningful when the natural structure of data is divisive in nature.

Disadvantages:

  1. Computationally very expensive (O(2^n) complexity).
  2. Not widely implemented in standard libraries.
  3. Requires a predefined stopping criterion for splitting.

Comparison with k-Means: k-Means is faster but requires predefining the number of clusters, while hierarchical clustering is slower but provides more insights.

from sklearn.cluster import AgglomerativeClustering, KMeans
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import fcluster

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print("Agglomerative Clustering Labels:", clustering.labels_)

# K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print("K-Means Clustering Labels:", kmeans.labels_)

# Compare the results (you can add more sophisticated comparison methods)
print("Are the cluster labels the same?", np.array_equal(clustering.labels_, kmeans.labels_))


Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()



max_dist = 3  # Example maximum distance. Adjust as needed based on the dendrogram.
clusters = fcluster(Z, max_dist, criterion='distance')
num_clusters = len(set(clusters))

print(f"Number of clusters (at max distance {max_dist}): {num_clusters}")


Output:

Agglomerative Clustering Labels: [1 1 1 0 0 0]
K-Means Clustering Labels: [1 0 1 0 0 0]
Are the cluster labels the same? False

 


Number of clusters (at max distance 3): 4

0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top