Wednesday, 15 January 2025

Data Visualization: Use visualization techniques to explore data patterns.

Data visualization is a powerful tool for exploring datasets, uncovering patterns, and communicating insights effectively. In this experiment, we focus on analyzing a salary dataset that contains two key features: Years of Experience and Salary. By leveraging various visualization techniques, we aim to understand the distribution of numerical variables, investigate relationships between them, and provide actionable insights.

This analysis is designed to illustrate the value of visual storytelling in data science. Whether you're a beginner exploring the world of analytics or an experienced data scientist, this experiment offers practical methods to make data come alive.

Objective

The primary objective of this experiment is to:

  1. Analyze Data Patterns: Understand how salary correlates with years of experience.
  2. Leverage Visualization Techniques: Use histograms, scatter plots, heatmaps, bar plots, and pie charts to explore the dataset.
  3. Simplify Interpretation: Provide insights that are easy to interpret and actionable for stakeholders.

Dataset Overview

The dataset consists of two columns:

  • YearsExperience: Represents the number of years an individual has worked.
  • Salary: Reflects the corresponding annual salary for the given experience level.

While simple, this dataset provides an excellent platform to demonstrate key visualization techniques applicable across industries.

Visualization Techniques

We employed the following methods to explore and understand the data:

  1. Histograms:

    • Used to analyze the distribution of numerical variables such as YearsExperience and Salary.
    • These visualizations help identify data skewness, clusters, and outliers.
  2. Scatter Plots:

    • Highlighted the relationship between YearsExperience and Salary.
    • These plots help us examine if there’s a linear correlation or other trends between the variables.
  3. Heatmaps:

    • Illustrated the correlation between numerical features.
    • Useful for identifying strong or weak relationships between variables.
  4. Bar Plots:

    • Visualized aggregated salaries grouped by experience levels (Junior, Mid, Senior).
    • This method simplifies comparison across categories.
  5. Pie Charts:

    • Represented the proportional distribution of employees across experience levels.
    • Ideal for understanding categorical distributions.

Outcome

Through this visualization experiment, we aim to extract meaningful insights such as:

  • The general trend of salary growth with increasing years of experience.
  • The range and variability of salaries in the dataset.
  • The proportional representation of employees across different experience levels.

These insights can inform decisions in hiring, salary bench marking, and workforce planning. Ultimately, this experiment underscores the importance of visualization in making complex data accessible and impactful.

 To plot the following graphs, 

1. Mount the drive

2. load the dataset into "data" variable

3. import the following libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


 Histogram : Create histograms for numerical variables


 

Interpretation of Histograms:


Each histogram represents the distribution of a single numerical variable from the dataset. The x-axis represents the range of values for that variable, and the y-axis shows the frequency or count of data points falling within each bin. You can observe the following from the histograms:

  • Central tendency: Where the data is clustered. A peak indicates a common value.
  • Spread/dispersion: How widely the data is distributed. A wide spread indicates high variability, while a narrow spread shows low variability.
  • Skewness: The symmetry of the distribution. A perfectly symmetrical distribution will be roughly mirrored around the central peak. Skewness can be positive (right-skewed, longer tail to the right), negative (left-skewed, longer tail to the left), or close to zero (symmetrical).
  • Outliers: Values that are significantly far from the rest of the data points. They may be seen as isolated bars in the tail of the distribution.

The histograms provide a visual summary of the numerical features in your data, revealing valuable insights about their distributions and potential relationships. For example, if you see a heavily skewed distribution, you might need to consider data transformations before using it in machine learning models.

2. Plot scatter plot to study relationships between variables

 

plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='YearsExperience', y='Salary', color='blue')
plt.title('Scatter Plot: Years of Experience vs Salary', fontsize=14)
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary', fontsize=12)
plt.show()


Output



 Interpretation of Scatter Plot:

  1. The scatter plot visualizes the relationship between 'YearsExperience' and 'Salary'.
  2. Each point represents a data entry, with its horizontal position determined by 'YearsExperience' and its vertical position by 'Salary'.

Key Observations from the Scatter Plot:

  • Trend: Observe the general trend of the data points. Do they show a positive correlation (upward trend), a negative correlation (downward trend), or no clear correlation? In the case of salary vs. years of experience, you'd typically expect a positive correlation – more experience tends to correspond to a higher salary.
  • Strength of Relationship: How closely do the data points follow the trend line? A strong positive correlation would show points tightly clustered around an upward-sloping line. A weak positive correlation would show points more dispersed but generally following an upward trend. No correlation would show a random distribution of points with no discernible pattern.
  •  Outliers: Are there any data points that deviate significantly from the general trend? These are potential outliers that might warrant further investigation. They could represent exceptional cases, errors in data collection, or other anomalies.
  • Clusters or Groups: Are there any distinct clusters or groups of data points within the scatter plot? These might suggest different subgroups or categories within the data, possibly indicating different salary scales or career paths.

 Example interpretations:

  • Strong Positive Correlation: If the points form a tight, upward-sloping cluster, it indicates a strong positive relationship between years of experience and salary. This suggests that as years of experience increase, salary also tends to increase significantly.
  • Weak Positive Correlation: If the points are more dispersed, but still generally follow an upward trend, it indicates a weaker relationship. Increases in experience are associated with salary increases, but the relationship is not as consistent.
  • No Correlation: If the points are randomly scattered with no discernible pattern, it indicates no relationship between years of experience and salary. Experience does not appear to be a good predictor of salary in this case.
  •  In the context of salary prediction, a clear positive correlation is desirable because it means you can use years of experience to predict salary with reasonable accuracy. However, you should also analyze other potential factors that influence salaries (e.g., education, job title, industry) to create a more robust prediction model.

3. Build heat maps to analyze feature correlations

plt.figure(figsize=(6, 5))
# Select only numerical features for correlation calculation
numerical_data = data.select_dtypes(include=['number'])
corr = numerical_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Heatmap', fontsize=14)
plt.show()


Output



 Interpretation of Heatmap:
  1. The heatmap visualizes the correlation matrix of numerical features in your dataset.
  2. Each cell in the heatmap represents the correlation coefficient between two features.
  3. The color intensity indicates the strength and direction of the correlation:
  •  Darker Red/Shades of Red: Strong positive correlation. As one feature increases, the other tends to increase.
  • Darker Blue/Shades of Blue: Strong negative correlation. As one feature increases, the other tends to decrease.
  • Lighter Colors (closer to white): Weak or no correlation. Changes in one feature do not have a strong relationship with changes in the other.
  • The diagonal of the heatmap always shows perfect positive correlation (1.0) because a feature is perfectly correlated with itself.
Example Interpretation:
  • Let's say you see a dark red cell between "YearsExperience" and "Salary" in the heatmap. This indicates a strong positive correlation, meaning that as years of experience increase, the salary tends to increase significantly.
  •  Conversely, if you observe a dark blue cell between two variables, it suggests that an increase in one variable corresponds to a decrease in another.
  • A cell near white implies a weak or no linear relationship between the two features.
 Key Considerations:
  • 1. Correlation vs. Causation: It's crucial to remember that correlation does not imply causation. Even if two variables are highly correlated, it does not necessarily mean that one causes the other. There might be other underlying factors influencing both.
  • 2. Strength of Correlation: The color intensity helps you gauge the strength of the linear relationship. However, correlations can be non-linear. A heatmap only captures linear relationships.
  • 3. Feature Selection: The heatmap helps in feature selection. If two features are highly correlated, you might consider using only one of them in your model to avoid redundancy and multicollinearity issues, which can negatively impact model performance and interpretability.

 In summary, the heatmap provides a quick and effective way to assess the relationships between numerical variables, aiding in feature engineering, model selection, and overall data understanding.

4. Visualize categorical data using bar plots and pie charts

# Since the dataset does not contain categorical data, we'll categorize 'YearsExperience' into bins for demonstration
data['ExperienceLevel'] = pd.cut(data['YearsExperience'], bins=[0, 2, 5, 10], labels=['Junior', 'Mid', 'Senior'])

# Bar plot for total salary by experience level
plt.figure(figsize=(8, 6))
sns.barplot(data=data, x='ExperienceLevel', y='Salary', estimator=sum, ci=None)
plt.title('Total Salary by Experience Level', fontsize=14)
plt.xlabel('Experience Level', fontsize=12)
plt.ylabel('Total Salary', fontsize=12)
plt.show()

# Pie chart for distribution of experience levels
experience_distribution = data['ExperienceLevel'].value_counts()
experience_distribution.plot.pie(autopct='%1.1f%%', figsize=(7, 7), title='Experience Level Distribution')
plt.ylabel('')  # Remove y-axis label for better appearance
plt.show()

 Output




Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top