Tuesday, 28 January 2025

Learn techniques to detect and handle outliers.

Understanding Outliers in Data Analysis

Understanding Outliers

  • Outliers are data items/objects that deviate significantly from the norm.
  • Identifying outliers is crucial in statistics and data analysis as they significantly impact statistical results.

Causes of Outliers

  • Measurement errors: Errors in data collection or measurement processes can lead to outliers.
  • Sampling errors: Issues with the sampling process can lead to outliers.
  • Natural variability: Inherent variability in certain phenomena can lead to outliers.
  • Data entry errors: Human errors during data entry can introduce outliers.
  • Experimental errors: Anomalies may occur due to uncontrolled factors, equipment malfunctions, or unexpected events.
  • Sampling from multiple populations: Data is inadvertently combined from multiple populations with different characteristics.
  • Intentional outliers: Outliers are intentionally introduced to test the robustness of statistical methods.

Program-1: Visualize outliers using box plots and scatter plots.

The dataset used in this article is the Diabetes dataset and it is preloaded in the Sklearn library. 

 Output

Visualizing and Removing Outliers Using Box Plot

It captures the summary of the data effectively and efficiently with only a simple box and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights(quartiles, median, and outliers) into the dataset by just looking at its boxplot. 

# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])
Lightbox

 In the above graph, can clearly see that values above 10 are acting as outliers.  Now, we will remove those


 download

Visualizing and Removing Outliers Using Scatter plot

It is used when you have paired numerical data and when your dependent variable has multiple values for each reading independent variable, or when trying to determine the relationship between the two variables. In the process of utilizing the scatter plot , one can also use it for outlier detection.

To plot the scatter plot one requires two variables that are somehow related to each other. So here, ‘Proportion of non-retail business acres per town’ and ‘Full-value property-tax rate per $10,000’ are used whose column names are “INDUS” and “TAX” respectively. 

fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('(body mass index of people)')
ax.set_ylabel('(bp of the people )')
plt.show()
Lightbox

 

Looking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.

Using approximation can say all those data points that are x>20 and y>600 are outliers. The following code can fetch the exact position of all those points that satisfy these conditions.

Removal of Outliers in BMI and BP Column Combined

Here, NumPy’s np.where() function is used to find the positions (indices) where the condition (df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8) is true in the DataFrame df_diabetics . The condition checks for outliers where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8. The output provides the row and column indices of the outlier positions in the DataFrame.

 

 Output:

 scatterplot

 

The outliers have been removed successfully.

Z-score

Z- Score is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

Zscore = (data_point -mean) / std. deviation

In this example, we are calculating the Z scores for the ‘age’ column in the DataFrame df_diabetics using the zscore function from the SciPy stats module. The resulting array z contains the absolute Z scores for each data point in the ‘age’ column, indicating how many standard deviations each value is from the mean.

 

from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)

 Output

 

 

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Removal of Outliers with Z-Score

Let’s remove rows where Z value is greater than 2.

In this example, we sets a threshold value of 2 and then uses NumPy’s np.where() to identify the positions (indices) in the Z-score array z where the absolute Z score is greater than the specified threshold (2). It prints the positions of the outliers in the ‘age’ column based on the Z-score criterion.

 

IQR (Inter Quartile Range)

IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most commonly used and most trusted approach used in the research field.

IQR = Quartile3 – Quartile1

Syntax : numpy.percentile(arr, n, axis=None, out=None)
Parameters :

  • arr :input array.
  • n : percentile value.

In this example, we are calculating the interquartile range (IQR) for the ‘bmi’ column in the DataFrame df_diabetics . It first computes the first quartile (Q1) and third quartile (Q3) using the midpoint method, then calculates the IQR as the difference between Q3 and Q1, providing a measure of the spread of the middle 50% of the data in the ‘bmi’ column.

 

 

 In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR = IQR + 0.5*IQR) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution.

 

Outlier Removal in Dataset using IQR

In this example, we are using the interquartile range (IQR) method to detect and remove outliers in the ‘bmi’ column of the diabetes dataset. It calculates the upper and lower limits based on the IQR, identifies outlier indices using Boolean arrays, and then removes the corresponding rows from the DataFrame, resulting in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison. 


 Output:

 

0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top