Data pre-processing is an especially important phase in any machine learning algorithm. Starting from missing values, duplicates, and finally determining data formats are essential steps to enhance the performance of the models. This guide gives detailed procedures describing how to approach these challenges, employing real examples such as the Titanic dataset and the Loan Prediction dataset.
Before performing operations on the given dataset, we first need to read and display the content of that file. The following procedure guides you on how to read and display the content of the file.
Mount your google drive
To mount google drive, Use the same email ID to log into Google Colab and mount Google Drive. The following code helps you.
Give the required permissions to mount, after successfully mount you get the message as"
Mounted at /content/drive" the bottom of code.
Read you file using full absolute path.
Use read_csv() for .csv file to read your dataset. The function available in the pandas package
Viewing the sample content of datasetOne can use the head() and tail() functions to read sample content in your dataset. The head() displays the first five rows, while the tail() displays the last five rows.
1. Detect and Handle Missing Values
Missing data can arise due to various reasons, such as errors in data collection or incomplete records. It's important to manage missing values effectively.
Steps to Handle Missing Values:
Step 1: Detect Missing Values
Usepandas
to identify missing values in the dataset:
The isnull()
method in pandas is used to detect missing values in a DataFrame or Series. Missing values are usually represented as NaN
(Not a Number) or None
. This method returns a DataFrame or Series of the same shape with True
where values are missing and False
otherwise.
Step 2: Handle Missing Values
Depending on the column type, handle missing values using:
Mean: Suitable for numeric columns without extreme outliers.
The fillna() method fill the NaN values with specified value. Here we specified fill NaN values with mean. after filling the single line display above age column without NaN. you also cross check this with the following code also
At the tile of reading file the Age column had 177 NaN values, in the above output it is Zero.
Median: Useful for numeric columns with outliers.
Outliers in numeric columns are extreme values that differ significantly from the majority of the data. They can negatively affect statistical calculations, such as the mean, and impact the performance of machine learning models
Mode: Ideal for categorical columns.
Interpolation: Good for filling gaps in sequential data.
Interpolation is a method of estimating missing values within a dataset based on the known values of surrounding data points. It is especially useful for handling missing values in numeric and time-series data where trends or patterns exist.
2. Identify and Remove Duplicate Rows
Duplicate rows can skew analysis and reduce model accuracy. Removing them ensures clean data.
Steps to Remove Duplicate Rows:
Step 1: Identify Duplicates
Step 2: Remove Duplicates
Use theduplicated()
method to detect duplicates:
Drop duplicate rows using:
3. Correct Inconsistent Data Formats
Inconsistent data formats can hinder data analysis and model performance. Ensure uniformity in your dataset.
Steps to Correct Data Formats:
Step 1: Categorical Data Formats
Standardize categorical values for consistency:
Cross check above operation
Step 2: Numeric Data Formats
Ensure numeric columns have appropriate data types:
Note: Click on the image to Zoom It.
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.