Several methods must be used to convert raw data for analysis in order to make it better suited for modeling and to guarantee that machine learning algorithms' presumptions are satisfied. I'll go over every transformation method in brief below, along with examples of how to use it on datasets like the Car Price or Iris datasets. Before executing the following code, watch the video here
1. Mount the Google Drive
2. Read the Dataset
3. Print top 5 row to know about data
The dataset contains the following columns:
- sepal_length, sepal_width, petal_length, petal_width: Numerical features.
- species: Categorical feature.
4. Log Transformation
Purpose:
- Logarithmic transformations are used to reduce skewness in numerical data, particularly when the data is highly skewed or has outliers.
- It compresses large values while expanding smaller ones, bringing the distribution closer to a normal distribution.
How it works:
- We use the formula:
- Adding 1 ensures the log function handles zero values without errors.
- This is applied to numerical columns:
sepal_length
,sepal_width
,petal_length
, andpetal_width
.
5. Scaling
Scaling adjusts the range and distribution of numerical data to make it suitable for machine learning models and statistical analysis.
a. Min-Max Scaling
Purpose:
- Rescales the values of numerical features to lie within a specific range, typically [0, 1].
- Useful for algorithms sensitive to the scale of the data, such as gradient-based optimization algorithms.
How it works:
- For a value , the formula is:
- This transformation ensures all values are within the range [0, 1].
Applied to:
- Numerical columns after log transformation.
b. Standardization (Z-Score Normalization)
Purpose:
- Centers the data around a mean of 0 and scales it to have a standard deviation of 1.
- Suitable when the data is assumed to follow a normal distribution or when preserving relative distances is critical.
How it works:
- For a value , the formula is: where is the mean and is the standard deviation.
Applied to:
- Numerical columns after log transformation.
6. Feature Encoding
Feature encoding is used to convert categorical variables into numerical formats that can be used by machine learning models.
a. One-Hot Encoding
Purpose:
- Converts a categorical variable into multiple binary (0/1) columns, where each column represents a category.
- Ensures no ordinal relationship is assumed between categories, making it suitable for nominal data.
How it works:
- For a categorical variable
species
with three unique values:setosa
,versicolor
, andvirginica
:- Create binary columns:
species_versicolor
andspecies_virginica
(dropping the first category to avoid redundancy). - A row with
species
=setosa
is represented as:
- Create binary columns:
- For a categorical variable
b. Label Encoding
Purpose:
- Converts a categorical variable into a single numerical column by assigning each category a unique integer label.
- Suitable for ordinal categorical variables where the order of categories has meaning.
How it works:
- For
species
:- Assign labels such as:
setosa
= 0versicolor
= 1virginica
= 2
- Assign labels such as:
- For
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.