Monday, 9 December 2024

Essential Data Transformation Techniques for Machine Learning: Log Transformation, Scaling, and Encoding

 Several methods must be used to convert raw data for analysis in order to make it better suited for modeling and to guarantee that machine learning algorithms' presumptions are satisfied. I'll go over every transformation method in brief below, along with examples of how to use it on datasets like the Car Price or Iris datasets. Before executing the following code, watch the video here

 1. Mount the Google Drive


 2. Read the Dataset


3. Print top 5 row to know about data

The dataset contains the following columns:

  • sepal_length, sepal_width, petal_length, petal_width: Numerical features.
  • species: Categorical feature.

4. Log Transformation

  • Purpose:

    • Logarithmic transformations are used to reduce skewness in numerical data, particularly when the data is highly skewed or has outliers.
    • It compresses large values while expanding smaller ones, bringing the distribution closer to a normal distribution.
  • How it works:

    • We use the formula: log-transformed value=log(1+x)
    • Adding 1 ensures the log function handles zero values without errors.
    • This is applied to numerical columns: sepal_length, sepal_width, petal_length, and petal_width.

5. Scaling

Scaling adjusts the range and distribution of numerical data to make it suitable for machine learning models and statistical analysis.

a. Min-Max Scaling

  • Purpose:

    • Rescales the values of numerical features to lie within a specific range, typically [0, 1].
    • Useful for algorithms sensitive to the scale of the data, such as gradient-based optimization algorithms.
  • How it works:

    • For a value , the formula is: xscaled=xmin(x)max(x)min(x)x_{\text{scaled}} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}
    • This transformation ensures all values are within the range [0, 1].
  • Applied to:

    • Numerical columns after log transformation.

 

b. Standardization (Z-Score Normalization)

  • Purpose:

    • Centers the data around a mean of 0 and scales it to have a standard deviation of 1.
    • Suitable when the data is assumed to follow a normal distribution or when preserving relative distances is critical.
  • How it works:

    • For a value xx, the formula is: z=xμσz = \frac{x - \mu}{\sigma} where \mu is the mean and \sigma is the standard deviation.
  • Applied to:

    • Numerical columns after log transformation.

 

6. Feature Encoding

Feature encoding is used to convert categorical variables into numerical formats that can be used by machine learning models.

a. One-Hot Encoding

  • Purpose:

    • Converts a categorical variable into multiple binary (0/1) columns, where each column represents a category.
    • Ensures no ordinal relationship is assumed between categories, making it suitable for nominal data.
  • How it works:

    • For a categorical variable species with three unique values: setosa, versicolor, and virginica:
      • Create binary columns: species_versicolor and species_virginica (dropping the first category to avoid redundancy).
      • A row with species = setosa is represented as: [species_versicolor=0,species_virginica=0][species\_versicolor = 0, species\_virginica = 0]


 

b. Label Encoding

  • Purpose:

    • Converts a categorical variable into a single numerical column by assigning each category a unique integer label.
    • Suitable for ordinal categorical variables where the order of categories has meaning.
  • How it works:

    • For species:
      • Assign labels such as:
        • setosa = 0
        • versicolor = 1
        • virginica = 2


0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top