Saturday, 22 February 2025

Classify data using the k-NN algorithm

Introduction

k‑Nearest Neighbors (KNN) is one of the simplest and most intuitive algorithms in machine learning. Despite its simplicity, KNN is powerful enough to solve both classification and regression tasks, making it a popular choice for many practical applications. This article will walk you through the principles behind KNN, how it works, and tips for getting the best performance from the algorithm.

How KNN Works

At its core, KNN is based on the idea that similar data points are likely to be near each other. For classification, KNN assigns a class label to an unknown data point by looking at the 'k' closest labeled points in the training set and using a majority vote. For regression, it predicts a value based on the average (or weighted average) of the k-nearest neighbors.

Key Steps in the KNN Algorithm:

  1. Choose the number of neighbors (k): Decide how many nearby points you want to consider.
  2. Measure distance: Compute the distance between the new data point and all the points in the training dataset. Common distance metrics include Euclidean, Manhattan, or Minkowski distance.
  3. Find the nearest neighbors: Select the k points in the training data that are closest to the new point.
  4. Vote on the output:
    • Classification: Use a majority vote among the k neighbors to decide the class.
    • Regression: Compute the average of the neighbors' values.
  5. Assign the class/estimate the value: The new point is classified or given a predicted value based on the above step.

Distance Metrics

Choosing a distance metric is crucial because it affects how neighbors are determined. Some common choices are:

  • Euclidean Distance: The straight-line distance between two points.
  • Manhattan Distance: The sum of the absolute differences of their coordinates.
  • Minkowski Distance: A generalization of both Euclidean and Manhattan distances.

The choice of distance metric can influence the algorithm’s performance, especially in datasets where the scale of features varies.

Choosing the Value of k

The parameter k is essential in KNN:

  • Small k (e.g., k = 1 or 2): The model becomes highly sensitive to noise and may overfit, as it considers very local patterns.
  • Large k: The model might be too smooth, leading to underfitting by averaging over too many points.

A common strategy is to test multiple values of k and select the one that provides the best performance using methods like cross-validation. Visualizing performance metrics (e.g., accuracy, error rate) as a function of k often helps in finding a balance.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from sklearn.model_selection import train_test_split

# Load the Iris dataset and select two features for visualization
iris = datasets.load_iris()
X = iris.data[:, 2:4]  # petal length and petal width
y = iris.target

# Create color maps for plotting
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold  = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Evaluate different k values from 1 to 20
k_values = range(1, 21)
test_scores = []

for k in k_values:
    knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    test_scores.append(score)

# Determine best k based on test set accuracy
best_k = k_values[np.argmax(test_scores)]
print("Best k:", best_k)
print("Test set accuracy with best k:", np.max(test_scores))

# Plot accuracy as a function of k
plt.figure(figsize=(6, 4))
plt.plot(k_values, test_scores, marker='o')
plt.title("k-NN Accuracy for Different k Values")
plt.xlabel("k (Number of Neighbors)")
plt.ylabel("Test Set Accuracy")
plt.xticks(k_values)
plt.show()

# Visualizing decision boundaries using best k
h = 0.02  # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict class for each point in the mesh
knn = neighbors.KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
plt.title(f"Decision Boundaries (k = {best_k})")
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()


 Best k: 2 Test set accuracy with best k: 1.0

 k‑Nearest Neighbors (KNN) isn’t just one fixed algorithm—it comes in several variations tailored to different needs and data characteristics. Here are some common types and adaptations:

1. Standard KNN

  • KNN for Classification:
    Predicts the class of a new sample by taking a majority vote among its k nearest neighbors.
    Example: In the Iris dataset, a new flower’s species might be determined by looking at the species of its closest flowers.

  • KNN for Regression:
    Estimates a continuous value (e.g., house price) by averaging the values of the k nearest neighbors.

2. Weighted KNN

  • Concept:
    Instead of giving equal influence to each neighbor, weighted KNN assigns higher weights to closer neighbors.
  • Advantage:
    This can improve performance when closer data points are more relevant than those further away.

3. Condensed KNN

  • Purpose:
    Reduces the size of the training dataset by selecting a subset of points that are most informative.
  • Benefit:
    It speeds up prediction and reduces storage without sacrificing much accuracy.

4. Edited KNN

  • Process:
    Removes noisy or misclassified instances from the training data, effectively “editing” the dataset.
  • Outcome:
    Smoother decision boundaries and improved overall performance by reducing the impact of outliers.

5. Approximate KNN

  • Challenge Addressed:
    When datasets are very large, calculating distances to every training point can be computationally intensive.
  • Solutions:
    Use specialized data structures and algorithms, such as:
    • KD-Trees: Efficient for low to moderate dimensional data.
    • Ball Trees: Better suited for higher-dimensional spaces.
    • Locality Sensitive Hashing (LSH): Provides faster approximate searches at the cost of some accuracy.

6. Fuzzy KNN

  • Approach:
    Instead of assigning a hard class label, fuzzy KNN provides a probability or membership degree for each class.
  • Advantage:
    This method can capture uncertainty, making it useful in cases where data points may not belong clearly to a single class.

 

Advantages and Disadvantages

Advantages:

  • Simplicity: Easy to understand and implement.
  • No Training Phase: KNN is a lazy learner; it stores the training data and makes predictions during inference, which is beneficial when the training set is small.
  • Versatility: Can be used for both classification and regression.

Disadvantages:

  • Computationally Intensive: Prediction time can be slow because the algorithm calculates the distance to every training point.
  • Sensitive to the Curse of Dimensionality: With a high number of features, the distance metric can become less meaningful, affecting performance.
  • Storage Requirement: Must store all training data, which may not be feasible for very large datasets.

Applications of KNN

KNN is widely used in various domains, such as:

  • Pattern Recognition: Handwriting detection, image classification, and facial recognition.
  • Recommendation Systems: Finding similar users or products.
  • Anomaly Detection: Identifying outliers in data by checking the distance to the nearest neighbors.
  • Medical Diagnosis: Classifying patient conditions based on similar previous cases.

0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top