Monday, 2 December 2024

Comprehensive Guide to Handling Missing Values, Duplicates, and Data Format Issues in Machine Learning

 Data pre-processing is an especially important phase in any machine learning algorithm. Starting from missing values, duplicates, and finally determining data formats are essential steps to enhance the performance of the models. This guide gives detailed procedures describing how to approach these challenges, employing real examples such as the Titanic dataset and the Loan Prediction dataset.

Before performing operations on the given dataset, we first need to read and display the content of that file. The following procedure guides you on how to read and display the content of the file.

Mount your google drive 

           To mount google drive, Use the same email ID to log into Google Colab and mount Google Drive. The following code helps you.

Give the required permissions to mount, after successfully mount you get the message as"

Mounted at /content/drive" the bottom of code.

Read you file using full absolute path.

Use read_csv() for .csv file to read your dataset. The function available in the pandas package 

Viewing the sample content of dataset

One can use the head() and tail() functions to read sample content in your dataset. The head() displays the first five rows, while the tail() displays the last five rows.



 

1. Detect and Handle Missing Values

Missing data can arise due to various reasons, such as errors in data collection or incomplete records. It's important to manage missing values effectively.

Steps to Handle Missing Values:

  • Step 1: Detect Missing Values
    Use pandas to identify missing values in the dataset:


     

 The isnull() method in pandas is used to detect missing values in a DataFrame or Series. Missing values are usually represented as NaN (Not a Number) or None. This method returns a DataFrame or Series of the same shape with True where values are missing and False otherwise.

Step 2: Handle Missing Values
Depending on the column type, handle missing values using:

Mean: Suitable for numeric columns without extreme outliers. 

 
 The fillna() method fill the NaN values with specified value. Here we specified fill NaN values with mean. after filling the single line display above age column without NaN. you also cross check this with the following code also

At the tile of reading file the Age column had 177 NaN values, in the above output it is Zero.

Median: Useful for numeric columns with outliers.


 Outliers in numeric columns are extreme values that differ significantly from the majority of the data. They can negatively affect statistical calculations, such as the mean, and impact the performance of machine learning models

Mode: Ideal for categorical columns.


 Interpolation: Good for filling gaps in sequential data.

 Interpolation is a method of estimating missing values within a dataset based on the known values of surrounding data points. It is especially useful for handling missing values in numeric and time-series data where trends or patterns exist.

2. Identify and Remove Duplicate Rows

Duplicate rows can skew analysis and reduce model accuracy. Removing them ensures clean data.

Steps to Remove Duplicate Rows:

  • Step 1: Identify Duplicates
    Use the duplicated() method to detect duplicates:

    Step 2: Remove Duplicates
    Drop duplicate rows using:

3. Correct Inconsistent Data Formats

Inconsistent data formats can hinder data analysis and model performance. Ensure uniformity in your dataset.

Steps to Correct Data Formats:

 Step 1: Categorical Data Formats 

Standardize categorical values for consistency:


 Cross check above operation


Step 2: Numeric Data Formats
Ensure numeric columns have appropriate data types:

 

Note: Click on the image to Zoom It.


0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top