Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. It does not require specifying the number of clusters in advance and is particularly useful for understanding the structure of data. It is mainly divided into Agglomerative Clustering (Bottom-Up Approach) and Divisive Clustering (Top-Down Approach).
1. Agglomerative Clustering (Bottom-Up Approach)
Agglomerative clustering starts with each data point as an individual cluster and iteratively merges the closest clusters until only one cluster remains.
Steps in Agglomerative Clustering:
- Initialize each data point as a separate cluster.
- Compute pairwise distances between clusters.
- Merge the two closest clusters based on a linkage criterion.
- Repeat steps 2-3 until all points belong to a single cluster or until a desired number of clusters is reached.
- Cut the dendrogram at the chosen level to obtain the final clusters.
Types of Linkage Methods:
- Single Linkage: Merges clusters based on the minimum distance between points.
- Complete Linkage: Uses the maximum distance between points.
- Average Linkage: Considers the average distance between all pairs of points in clusters.
- Ward’s Method: Minimizes variance within clusters.
Advantages of Agglomerative Clustering:
- No need to predefine the number of clusters.
- Suitable for small to medium-sized datasets.
- Produces a dendrogram, which helps in deciding the optimal number of clusters.
Disadvantages:
- Computationally expensive for large datasets (O(n²) complexity).
- Sensitive to noise and outliers.
2. Divisive Clustering (Top-Down Approach)
Divisive clustering starts with all data points in a single cluster and recursively splits the cluster into smaller clusters until each data point is its own cluster.
Steps in Divisive Clustering:
- Consider all data points as a single cluster.
- Use a clustering algorithm (e.g., k-Means) to divide the cluster into two sub-clusters.
- Recursively repeat step 2 on each cluster until each point is in its own cluster or the stopping criterion is met.
- The resulting dendrogram is then cut at an appropriate level to define the final clusters.
Advantages of Divisive Clustering:
- More accurate in some cases, as it doesn't suffer from early erroneous merges.
- Can be more meaningful when the natural structure of data is divisive in nature.
Disadvantages:
- Computationally very expensive (O(2^n) complexity).
- Not widely implemented in standard libraries.
- Requires a predefined stopping criterion for splitting.
Comparison with k-Means: k-Means is faster but requires predefining the number of clusters, while hierarchical clustering is slower but provides more insights.
Output:
Agglomerative Clustering Labels: [1 1 1 0 0 0] K-Means Clustering Labels: [1 0 1 0 0 0] Are the cluster labels the same? False
Number of clusters (at max distance 3): 4
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.