Sunday, 19 January 2025

Exploring Dimensionality Reduction with PCA

Data science is packed to the brim with techniques, and the one we will examine in this article is typically known as dimensionality reduction. Reducing the number of the number of features reduces the complexity of our models, makes them more interpretable and, perhaps, even improves the performance of the model. In this post we'll explore Principal Component Analysis (PCA), one of the most popular methods for Dimensionality Reduction, using the Wine dataset for an instance.

Introduction to PCA

PCA (Principal Component Analysis) is a statistical approach for changing the dataset’s coordinate system to a new one. It states out the directions, which are referred as principal components, along which the variance of data is maximum. PCA lets us find patterns in high dimensional data, and visualize it in lower dimensions (e.g. 2D or 3D). 

Dataset Overview

The Wine dataset is a classic benchmark dataset containing:

  • Features: 13 numerical attributes, representing the chemical composition of different wines.
  • Target: A label indicating the wine's cultivar (class).

This high-dimensional dataset is ideal for demonstrating the power of PCA.

Steps in PCA Analysis

1. Apply PCA to Reduce Dimensions

The Wine dataset was reduced to 3 principal components. Each principal component is a linear combination of the original features, constructed to maximize variance. By retaining only a few components, we compress the data while preserving its structure.

2. Visualize Principal Components

We created scatter plots to visualize the data in 2D and 3D. Points were colored by their class labels, revealing clusters that correspond to the wine cultivars.

3. Analyze Variance Explained

The explained variance measures how much information (variance) each principal component retains. The cumulative variance helps determine how many components are sufficient to represent the data.

Key Visualizations

Histogram of Principal Components

The histograms show the distribution of values in the first few principal components, highlighting the compressed nature of the transformed data.

2D Visualization

A scatter plot of the first two principal components reveals clear separations between wine cultivars. This suggests that much of the data's structure can be captured in just two dimensions.

3D Visualization

Adding a third principal component provides an additional dimension for separating the classes. This 3D scatter plot offers a richer view of the clusters.

Explained Variance Analysis

The scree plot and cumulative variance plot demonstrate how much variance is captured by each component:

  • The first few components capture the majority of the variance.
  • For example, the first two components captured 85% of the variance, making them sufficient for most analyses.

Insights and Observations

  1. Dimensionality Reduction: Using PCA, we reduced 13 features to just 2 or 3 components while retaining most of the dataset's variance.
  2. Visualization: The transformed data revealed clear separations between wine cultivars, even in 2D.
  3. Explained Variance: The first three components captured over 90% of the variance, demonstrating PCA's efficiency.
 
 The following are the implementation steps 

mount the drive 



 a) load the predefined dataset, apply PCA, and visualize it

b) Visualize principal components in a 2D plot.


 Visualize principal components in a 3D plot.
 


 
 
 
 c) Analyze variance explained by each principal component.
 

 
 Explained_variance: This array shows the proportion of variance in the original dataset that is explained by each principal component. For example, if `explained_variance[0]` is 0.4, it means the first principal component accounts for 40% of the total variance in the data. The sum of all elements in `explained_variance` will always be 1 (or very close to 1 due to numerical precision). Each subsequent principal component explains less variance than the preceding one, as PCA orders components by the amount of variance they capture.

Cumulative_variance: This array shows the cumulative proportion of variance explained by the principal components up to a given point. `cumulative_variance[0]` will be the same as `explained_variance[0]`. `cumulative_variance[1]` will be the sum of `explained_variance[0]` and `explained_variance[1]`, representing the total variance explained by the first two principal components. `cumulative_variance[2]` will be the sum of the first three principal components, and so on. This is crucial to understand how many principal components are needed to retain a sufficient amount of information from the original dataset. For example, if `cumulative_variance[2]` is 0.95, it signifies that the first three principal components explain 95% of the total variance in the original data.
 
In the context of dimensionality reduction: The `explained_variance` and `cumulative_variance` help you decide how many principal components to keep. You aim to reduce the dimensionality of your data while retaining most of the variance. You can choose a threshold for cumulative variance (e.g., 95% or 99%) and select the number of components needed to reach that threshold. In this code, only the first three principal components are plotted, but the actual number of principal components that are statistically significant would depend on your analysis goals and the cumulative variance retained.



The scree plot visualizes the cumulative variance explained by each principal component. It helps determine the optimal number of components to retain for dimensionality reduction.
 
Look for an "elbow point" in the scree plot, where the slope starts to flatten out. This indicates the point of diminishing returns – adding more principal components provides only a small increase in explained variance. The number of components  before the elbow is often a good choice for dimensionality reduction.




0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top