Essential Data Transformation Techniques for Machine Learning: Log Transformation, Scaling, and Encoding ~ TUTORIALTPOINT- Java Tutorial, C Tutorial, DBMS Tutorial

Several methods must be used to convert raw data for analysis in order to make it better suited for modeling and to guarantee that machine learning algorithms' presumptions are satisfied. I'll go over every transformation method in brief below, along with examples of how to use it on datasets like the Car Price or Iris datasets. Before executing the following code, watch the video here

1. Mount the Google Drive

2. Read the Dataset

3. Print top 5 row to know about data

The dataset contains the following columns:

sepal_length, sepal_width, petal_length, petal_width: Numerical features.
species: Categorical feature.

4. Log Transformation

Purpose:
- Logarithmic transformations are used to reduce skewness in numerical data, particularly when the data is highly skewed or has outliers.
- It compresses large values while expanding smaller ones, bringing the distribution closer to a normal distribution.
How it works:
- We use the formula: $log-transformed value = \log (1 + x)$
- Adding 1 ensures the log function handles zero values without errors.
- This is applied to numerical columns: sepal_length, sepal_width, petal_length, and petal_width.

5. Scaling

Scaling adjusts the range and distribution of numerical data to make it suitable for machine learning models and statistical analysis.

a. Min-Max Scaling

Purpose:
- Rescales the values of numerical features to lie within a specific range, typically [0, 1].
- Useful for algorithms sensitive to the scale of the data, such as gradient-based optimization algorithms.
How it works:
- For a value $x$ , the formula is: $x_{\text{scaled}} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}$
- This transformation ensures all values are within the range [0, 1].
Applied to:
- Numerical columns after log transformation.

b. Standardization (Z-Score Normalization)

Purpose:
- Centers the data around a mean of 0 and scales it to have a standard deviation of 1.
- Suitable when the data is assumed to follow a normal distribution or when preserving relative distances is critical.
How it works:
- For a value $x$ , the formula is: $z = \frac{x - \mu}{\sigma}$ where $\mu$ is the mean and $\sigma$ is the standard deviation.
Applied to:
- Numerical columns after log transformation.

6. Feature Encoding

Feature encoding is used to convert categorical variables into numerical formats that can be used by machine learning models.

a. One-Hot Encoding

Purpose:
- Converts a categorical variable into multiple binary (0/1) columns, where each column represents a category.
- Ensures no ordinal relationship is assumed between categories, making it suitable for nominal data.
How it works:
- For a categorical variable species with three unique values: setosa, versicolor, and virginica:
  - Create binary columns: species_versicolor and species_virginica (dropping the first category to avoid redundancy).
  - A row with species = setosa is represented as: $[species\_versicolor = 0, species\_virginica = 0]$

b. Label Encoding

Purpose:
- Converts a categorical variable into a single numerical column by assigning each category a unique integer label.
- Suitable for ordinal categorical variables where the order of categories has meaning.
How it works:
- For species:
  - Assign labels such as:
    - setosa = 0
    - versicolor = 1
    - virginica = 2

Monday, 9 December 2024

Essential Data Transformation Techniques for Machine Learning: Log Transformation, Scaling, and Encoding

1. Mount the Google Drive

2. Read the Dataset

3. Print top 5 row to know about data

4. Log Transformation

5. Scaling

a. Min-Max Scaling

b. Standardization (Z-Score Normalization)

6. Feature Encoding

a. One-Hot Encoding

b. Label Encoding

0 comments :

Post a Comment

NumPy Tutorial

Advertisement

Java Tutorial

UGC NET CS TUTORIAL

Data Base Management

C Programming

Python Tutorial

GATE TUTORIAL

Data Structures

computer Organization

Computer Basics