Let's implement multiple clustering algorithms on the Wholesale Customer dataset and evaluate them using Silhouette Score and Davies-Bouldin Index.
Clustering Methods to Implement:
- k-Means Clustering (Partition-based)
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Density-based)
- GMM (Gaussian Mixture Model) (Probabilistic-based)
- Hierarchical Clustering
Evaluation Metrics:
- Silhouette Score: Measures how well-separated the clusters are.
- Davies-Bouldin Index: Measures intra-cluster similarity and inter-cluster differences.
Step-1: Mount the drive
from google.colab import drive
drive.mount('/content/drive')
Step-2: Read the Wholesale_customers_data.csv dataset
Step-2: Preprocessing: Selecting relevant features and standardizing them
Step-3: Standardize the dataset for better clustering performance
from sklearn.preprocessing import StandardScaler # Import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
clustering_results = {}
Step-5: Prepare Function to evaluate clustering results
def evaluate_clustering(labels, X_scaled):
if len(set(labels)) > 1: # Ensure we have more than 1 cluster
silhouette = silhouette_score(X_scaled, labels)
db_index = davies_bouldin_score(X_scaled, labels)
else:
silhouette = -1 # Undefined for a single cluster
db_index = -1 # Undefined for a single cluster
return silhouette, db_index
Step-6: k-Means Clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
clustering_results['k-Means'] = evaluate_clustering(kmeans_labels, X_scaled)
Step-7: DBSCAN Clustering
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
clustering_results['DBSCAN'] = evaluate_clustering(dbscan_labels, X_scaled)
Step-8: Gaussian Mixture Model (GMM)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)
clustering_results['GMM'] = evaluate_clustering(gmm_labels, X_scaled)
Step-9: Hierarchical (Agglomerative) Clustering
from sklearn.cluster import AgglomerativeClustering
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X_scaled)
# Evaluate Hierarchical Clustering
clustering_results['Hierarchical'] = evaluate_clustering(hierarchical_labels, X_scaled)
Step-10: Convert results to DataFrame for easy comparison
clustering_eval_df = pd.DataFrame.from_dict(
clustering_results, orient='index', columns=['Silhouette Score', 'Davies-Bouldin Index']
)
Step-11: Comparing Different Clustering Algorithms
import pandas as pd # If pandas is not already imported
print("Clustering Evaluation Metrics (Including Hierarchical):")
display(clustering_eval_df)
Final output
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.