Sunday, 16 February 2025

Prepare a Classification model using Navie Bayes Classifier

The Naive Bayes classifier is a simple but effective probabilistic learning algorithm based on the Bayes theorem with strong independence assumptions among the features. Despite the naive in its name, Naive Bayes has been proved a great classifier not only for text related tasks like spam filtering, and sport events classification but also to predict medical diagnosis. 

  • Naïve Bayes is a classification algorithm for categorical variables, which is based on the well-known Bayes theorem.
  • Used mostly in high-dimensional text classification
  • The Naïve Bayes Classifier is a simple probabilistic classifier and it has very few number of parameters which are used to build the ML models that can predict at a faster speed than other classification algorithms.
  • It is a probabilistic classifier i.e., it predicts based on the likelihood of an object.
  • Naïve Bayes Algorithm: It is used in spam filtration, Sentimental analysis, classifying articles and many more.

 



The most popular types differ based on the distributions of the feature values. Some of these include:

  • Gaussian Naïve Bayes (GaussianNB): This is a variant of the Naïve Bayes classifier, which is used with Gaussian distributions—i.e. normal distributions—and continuous variables. This model is fitted by finding the mean and standard deviation of each class.
  • Multinomial Naïve Bayes (MultinomialNB): This type of Naïve Bayes classifier assumes that the features are from multinomial distributions. This variant is useful when using discrete data, such as frequency counts, and it is typically applied within natural language processing use cases, like spam classification.
  • Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the Naïve Bayes classifier, which is used with Boolean variables—that is, variables with two values, such as True and False or 1 and 0.

Numerical Example

 We have the following 4 records (each with 6 attributes plus the output):


 We want to build a Naïve Bayes classifier that predicts Output{yes,no}\in \{\text{yes}, \text{no}\ given the attributes.

Prior Probabilities

Let:

  • C=OutputC = \text{Output}
  • C=yes or C=noC = \text{no}

Count how many yes vs. no in the dataset:

  • yes: 3 records (Rows 1, 2, 4)
  • no: 1 record (Row 3)

Hence:

P(yes)=34=0.75,P(no)=14=0.25.P(\text{yes}) = \frac{3}{4} = 0.75, \quad P(\text{no}) = \frac{1}{4} = 0.25.

 

Likelihoods (Conditional Probabilities)

Naïve Bayes requires computing P(Attribute ValueClass)P(\text{Attribute Value} \mid \text{Class}). We look at each attribute-value pair for each class.

3.1. Sky

  • Sky = sunny or rainy.

Class = yes
There are 3 “yes” examples:

  1. (sunny, warm, normal, strong, warm, same)
  2. (sunny, warm, high, strong, warm, same)
  3. (sunny, warm, high, strong, cool, change)

All 3 have Sky = sunny, so:

P(sky=sunnyyes)=33=1.0,P(sky=rainyyes)=03=0.0.P(\text{sky} = \text{sunny} \mid \text{yes}) = \frac{3}{3} = 1.0, \quad P(\text{sky} = \text{rainy} \mid \text{yes}) = \frac{0}{3} = 0.0.

Class = no
There is 1 “no” example: 3. (rainy, cold, high, strong, warm, change)

That single example has Sky = rainy, so:

P(sky=rainyno)=11=1.0,P(sky=sunnyno)=01=0.0.P(\text{sky} = \text{rainy} \mid \text{no}) = \frac{1}{1} = 1.0, \quad P(\text{sky} = \text{sunny} \mid \text{no}) = \frac{0}{1} = 0.0.





Checking Other Rows

  • Row 2 (sunny,warm,high,strong,warm,same)(\text{sunny}, \text{warm}, \text{high}, \text{strong}, \text{warm}, \text{same}) is also yes in the dataset. A similar calculation yields a nonzero value for yes and 0 for no.
  • Row 3 (rainy,cold,high,strong,warm,change)(\text{rainy}, \text{cold}, \text{high}, \text{strong}, \text{warm}, \text{change}) is no. If you compute with “yes,” you’ll see some attribute probability is 0 (e.g., P(sky=rainy|yes)=0 or P(temp=cold|yes)=0), leading to a product of 0 for yes. For no, it will be nonzero, so we pick no.
  • Row 4 (sunny,warm,high,strong,cool,change)(\text{sunny}, \text{warm}, \text{high}, \text{strong}, \text{cool}, \text{change}) ends up with a nonzero value for yes and 0 for no.

Hence, all 4 records are classified correctly by these computations.

 Python Code to develop above example is as follows

import numpy as np

# Given dataset (encoded manually)
data = {
    "Sky": ["sunny", "sunny", "rainy", "sunny"],
    "Temperature": ["warm", "warm", "cold", "warm"],
    "Humid": ["normal", "high", "high", "high"],
    "Wind": ["strong", "strong", "strong", "strong"],
    "Water": ["warm", "warm", "warm", "cool"],
    "Forest": ["same", "same", "change", "change"],
    "Output": ["yes", "yes", "no", "yes"]
}

# Unique class labels
classes = ["yes", "no"]

# Encode categorical variables
from collections import defaultdict

encoder = defaultdict(dict)
for column in data.keys():
    unique_vals = list(set(data[column]))
    for i, val in enumerate(unique_vals):
        encoder[column][val] = i

# Encode dataset
encoded_data = {col: [encoder[col][val] for val in values] for col, values in data.items()}

def compute_prior(y):
    priors = {}
    total = len(y)
    for c in classes:
        priors[c] = y.count(c) / total
    return priors

def compute_likelihoods(X, y):
    likelihoods = {}
    for feature in X.keys():
        likelihoods[feature] = {}
        for c in classes:
            likelihoods[feature][c] = {}
            class_count = y.count(c)
            for value in set(X[feature]):
                count = sum(1 for i in range(len(y)) if X[feature][i] == value and y[i] == c)
                likelihoods[feature][c][value] = count / class_count if class_count > 0 else 0
    return likelihoods

# Compute priors and likelihoods
y = encoded_data["Output"]
X = {key: val for key, val in encoded_data.items() if key != "Output"}
priors = compute_prior(data["Output"])
likelihoods = compute_likelihoods(X, data["Output"])

def predict(sample):
    posterior_probs = {}
    for c in classes:
        posterior_probs[c] = priors[c]
        for feature, value in sample.items():
            if value in likelihoods[feature][c]:
                posterior_probs[c] *= likelihoods[feature][c][value]
            else:
                posterior_probs[c] *= 0  # Handle zero probability cases
    return max(posterior_probs, key=posterior_probs.get)

# Example test case
sample = {"Sky": "sunny", "Temperature": "warm", "Humid": "normal", "Wind": "strong", "Water": "warm", "Forest": "same"}
encoded_sample = {key: encoder[key][value] for key, value in sample.items()}
prediction = predict(encoded_sample)

print(f"Predicted Output for {sample}: {prediction}")

Wednesday, 12 February 2025

Comparing Different Clustering Algorithms like K-means, DBSCAN, GMM, Hierarchical Clustering

Let's implement multiple clustering algorithms on the Wholesale Customer dataset and evaluate them using Silhouette Score and Davies-Bouldin Index.

Clustering Methods to Implement:

  1. k-Means Clustering (Partition-based)
  2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Density-based)
  3. GMM (Gaussian Mixture Model) (Probabilistic-based)
  4.  Hierarchical Clustering

Evaluation Metrics:

  • Silhouette Score: Measures how well-separated the clusters are.
  • Davies-Bouldin Index: Measures intra-cluster similarity and inter-cluster differences.

Step-1: Mount the drive

from google.colab import drive
drive.mount('/content/drive')


 Step-2: Read the Wholesale_customers_data.csv dataset 


Step-2:  Preprocessing: Selecting relevant features and standardizing them

 

 Step-3:   Standardize the dataset for better clustering performance

from sklearn.preprocessing import StandardScaler # Import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Step-4: Store results in a dictionary for evaluation


clustering_results = {}

 Step-5:  Prepare Function to evaluate clustering results

def evaluate_clustering(labels, X_scaled):
    if len(set(labels)) > 1:  # Ensure we have more than 1 cluster
        silhouette = silhouette_score(X_scaled, labels)
        db_index = davies_bouldin_score(X_scaled, labels)
    else:
        silhouette = -1  # Undefined for a single cluster
        db_index = -1  # Undefined for a single cluster
    return silhouette, db_index

 Step-6k-Means Clustering


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
clustering_results['k-Means'] = evaluate_clustering(kmeans_labels, X_scaled)

 Step-7: DBSCAN Clustering


from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
clustering_results['DBSCAN'] = evaluate_clustering(dbscan_labels, X_scaled)

 Step-8Gaussian Mixture Model (GMM)


from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)
clustering_results['GMM'] = evaluate_clustering(gmm_labels, X_scaled)

 Step-9Hierarchical (Agglomerative) Clustering

from sklearn.cluster import AgglomerativeClustering

hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X_scaled)

# Evaluate Hierarchical Clustering
clustering_results['Hierarchical'] = evaluate_clustering(hierarchical_labels, X_scaled)

Step-10:  Convert results to DataFrame for easy comparison


clustering_eval_df = pd.DataFrame.from_dict(
    clustering_results, orient='index', columns=['Silhouette Score', 'Davies-Bouldin Index']
)

 Step-11: Comparing Different Clustering Algorithms

import pandas as pd # If pandas is not already imported

print("Clustering Evaluation Metrics (Including Hierarchical):")
display(clustering_eval_df)

 Final output 


Agglomerative and Divisive Clustering in Hierarchical Clustering

Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. It does not require specifying the number of clusters in advance and is particularly useful for understanding the structure of data. It is mainly divided into Agglomerative Clustering (Bottom-Up Approach) and Divisive Clustering (Top-Down Approach).

1. Agglomerative Clustering (Bottom-Up Approach)

Agglomerative clustering starts with each data point as an individual cluster and iteratively merges the closest clusters until only one cluster remains.

Steps in Agglomerative Clustering:

  1. Initialize each data point as a separate cluster.
  2. Compute pairwise distances between clusters.
  3. Merge the two closest clusters based on a linkage criterion.
  4. Repeat steps 2-3 until all points belong to a single cluster or until a desired number of clusters is reached.
  5. Cut the dendrogram at the chosen level to obtain the final clusters.

Types of Linkage Methods:

  • Single Linkage: Merges clusters based on the minimum distance between points.
  • Complete Linkage: Uses the maximum distance between points.
  • Average Linkage: Considers the average distance between all pairs of points in clusters.
  • Ward’s Method: Minimizes variance within clusters.

Advantages of Agglomerative Clustering:

  1. No need to predefine the number of clusters.
  2. Suitable for small to medium-sized datasets.
  3. Produces a dendrogram, which helps in deciding the optimal number of clusters.

Disadvantages:

  1. Computationally expensive for large datasets (O(n²) complexity).
  2. Sensitive to noise and outliers.

2. Divisive Clustering (Top-Down Approach)

Divisive clustering starts with all data points in a single cluster and recursively splits the cluster into smaller clusters until each data point is its own cluster.

Steps in Divisive Clustering:

  1. Consider all data points as a single cluster.
  2. Use a clustering algorithm (e.g., k-Means) to divide the cluster into two sub-clusters.
  3. Recursively repeat step 2 on each cluster until each point is in its own cluster or the stopping criterion is met.
  4. The resulting dendrogram is then cut at an appropriate level to define the final clusters.

Advantages of Divisive Clustering:

  1. More accurate in some cases, as it doesn't suffer from early erroneous merges.
  2. Can be more meaningful when the natural structure of data is divisive in nature.

Disadvantages:

  1. Computationally very expensive (O(2^n) complexity).
  2. Not widely implemented in standard libraries.
  3. Requires a predefined stopping criterion for splitting.

Comparison with k-Means: k-Means is faster but requires predefining the number of clusters, while hierarchical clustering is slower but provides more insights.

from sklearn.cluster import AgglomerativeClustering, KMeans
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import fcluster

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print("Agglomerative Clustering Labels:", clustering.labels_)

# K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print("K-Means Clustering Labels:", kmeans.labels_)

# Compare the results (you can add more sophisticated comparison methods)
print("Are the cluster labels the same?", np.array_equal(clustering.labels_, kmeans.labels_))


Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()



max_dist = 3  # Example maximum distance. Adjust as needed based on the dendrogram.
clusters = fcluster(Z, max_dist, criterion='distance')
num_clusters = len(set(clusters))

print(f"Number of clusters (at max distance {max_dist}): {num_clusters}")


Output:

Agglomerative Clustering Labels: [1 1 1 0 0 0]
K-Means Clustering Labels: [1 0 1 0 0 0]
Are the cluster labels the same? False

 


Number of clusters (at max distance 3): 4

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top