Monday, 16 December 2024

Build a prediction model to perform logistic regression.

Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike linear regression that predicts a continuous outcome.

Logistic regression is a statistical method used for binary classification tasks. It predicts the probability of an event occurring, such as whether an email is spam or not, or whether a customer will churn or not. Logistic regression is a powerful tool that can be used to model complex relationships between variables, and it is widely used in a variety of fields, including economics, finance, sociology, and psychology.

The formula for logistic regression is:

P(y = 1) = 1 / (1 + e^(-β0 + β1X1 + β2X2 + ... + βpX))

where:

  • P(y = 1) is the probability that the event occurs
  • β0 is the intercept
  • β1, β2, ..., βp are the regression coefficients
  • X1, X2, ..., Xp are the explanatory variables
The regression coefficients represent the change in the log odds of the event occurring for a one-unit increase in the corresponding explanatory variable, holding all other explanatory variables constant.

In the simplest case there are two outcomes, which is called binomial, an example of which is predicting if a tumor is malignant or benign. Other cases have more than two outcomes to classify, in this case it is called multinomial. A common example for multinomial logistic regression would be predicting the class of an iris flower between 3 different species.

Here we will be using basic logistic regression to predict a binomial variable. This means it has only two possible outcomes.


 

 

What is Confusion Matrix and why you need it?

Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

                  

It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves.

Let’s understand TP, FP, FN, TN in terms of pregnancy analogy.

True Positive:

Interpretation: You predicted positive and it’s true.

You predicted that a woman is pregnant and she actually is.

True Negative:

Interpretation: You predicted negative and it’s true.

You predicted that a man is not pregnant and he actually is not.

False Positive: (Type 1 Error)

Interpretation: You predicted positive and it’s false.

You predicted that a man is pregnant but he actually is not.

False Negative: (Type 2 Error)

Interpretation: You predicted negative and it’s false.

You predicted that a woman is not pregnant but she actually is.

Recall

The above equation can be explained by saying, from all the positive classes, how many we predicted correctly.

Recall should be high as possible.

Precision

The above equation can be explained by saying, from all the classes we have predicted as positive, how many are actually positive.

Precision should be high as possible.

and

Accuracy

From all the classes (positive and negative), how many of them we have predicted correctly. In this case, it will be 4/7.

Accuracy should be high as possible.

F-measure

It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

 Now, we will find above for our Model

Applications of logistic regression:

  • Predicting spam emails: Logistic regression can be used to predict whether an email is spam or not based on factors such as the sender, subject line, and content of the email.
  •  Modeling customer churn: Logistic regression can be used to model the likelihood of a customer churning, or canceling their service, based on factors such as their demographics, usage patterns, and satisfaction levels.  
  • Detecting fraudulent transactions: Logistic regression can be used to detect fraudulent transactions in real-time based on factors such as the transaction amount, location, and time of day.

Limitations of logistic regression:

  • Linearity: Logistic regression assumes that the relationship between the explanatory variables and the log odds of the event occurring is linear. If the relationship is nonlinear, logistic regression will not be accurate.

  • Multicollinearity: Multicollinearity occurs when two or more explanatory variables are highly correlated with each other. Multicollinearity can make it difficult to interpret the regression coefficients.

  • Overfitting: Overfitting occurs when the model fits the training data too closely and does not generalize well to new data. Overfitting can be reduced by using regularization techniques such as L1 or L2 regularization.

Build a prediction model for simple linear regression

 Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

  • Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc.
  • Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε 

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent variable). The goals of this problem is:

  • We want to find out if there is any correlation between these two variables
  • We will find the best fit line for the dataset.
  • How the dependent variable is changing by changing the independent variable.

 

 1. Important the necessary libraries


2. Important the necessary data set and extract two variables 


Check for conformation of two variables

3. Now Splitting the dataset into the Training set and Test set

Display the all splits 



4. Now Train the Algorithm on your data set. If Training is successful, it will generate a model.


As you can see on the 8th line output, the model was generated. In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations between the predictor and target variables.

5. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively. 

y_pred = regressor.predict(X_test)


 6. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name ("Salary vs Experience (Training Dataset)". 

In the above plot, we can see the real values observations in red dots and predicted values are covered by the blue regression line. The regression line shows a correlation between the dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual values and predicted values. But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set.

7. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train.


 

 In the above plot, there are observations given by the red color, and prediction is given by the blue regression line. As we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is a good model and able to make good predictions.

8. Evaluating Model Performance / Performance evaluation metrics

Regression metrics serve as quantitative measures to assess the performance of regression models by evaluating the disparity between predicted and actual values.

Let’s explore some of the most commonly used regression metrics:


 

1. Mean Squared Error (MSE)

MSE calculates the average squared difference between predicted and actual values.

mean squared error

where yi​ represents the actual value, y^​i​ represents the predicted value, and n is the number of observations.

MSE measures the average squared error, with higher values indicating more significant discrepancies between predicted and actual values.

y_actual - y_predicted

MSE penalizes more significant errors due to squaring, making it sensitive to outliers. It is commonly used due to its mathematical properties but may be less interpretable than other metrics.

2. Root Mean Squared Error (RMSE)

RMSE is the square root of the MSE and measures the average magnitude of errors.

RMSE=square_root(MSE​)

RMSE shares a similar interpretation to MSE but is in the same units as the dependent variable, making it more interpretable.

RMSE is preferred when the distribution of errors is not normal or when outliers are present, as it mitigates the impact of large errors.

3. Mean Absolute Error (MAE)

MAE computes the average absolute difference between predicted and actual values.

mean absolute error example

It measures the average magnitude of errors, with higher values indicating larger discrepancies between predicted and actual values.

MAE is less sensitive to outliers than MSE but may not adequately penalize large errors.

4. R-squared (R²)

R² measures the proportion of variance in the dependent variable explained by the independent variables.

where SSR is the sum of squared residuals, and SST is the total sum of squares.

R² ranges from 0 to 1, with higher values indicating a better fit of the model to the data. However, it does not provide information about the goodness of individual predictions.

R² may artificially increase with more independent variables, and a high R² does not necessarily imply a good model fit.

Now, we will implement these in Python for above simple linear regression model


The following is the output for the above code.

  • R-squared: 0.9749154407708353
  • Mean Absolute Error: 3426.4269374307078
  • Root Mean Squared Error: 4585.4157204675885
  • Mean Absolute Error: 3426.4269374307078

Interpretation of Metrics for Simple Linear Regression

1. R-squared (R² = 0.9749):

  • What it means:
    R-squared measures the proportion of variance in the dependent variable (target) that the independent variable (predictor) explains.

    • A value of 0.9749 indicates that 97.49% of the variance in the target variable is explained by the regression model.
    • This suggests the model fits the data very well.
  • Consideration:
    While the R² value is high, it does not confirm the model's predictions are error-free. For a more comprehensive understanding, consider other metrics like MAE and RMSE.


2. Mean Absolute Error (MAE = 3426.43):

  • What it means:
    MAE is the average of the absolute errors between the actual and predicted values.

    • On average, the model's predictions are 3426.43 units away from the actual values.
  • Interpretation:
    This provides an intuitive measure of the typical error magnitude, but it does not indicate how large individual errors can get.

    • If the target variable values are in the range of tens of thousands, an MAE of 3426 may be acceptable. However, if the target values are much smaller, this error might be significant.

3. Root Mean Squared Error (RMSE = 4585.42):

  • What it means:
    RMSE is the square root of the average squared differences between predicted and actual values.

    • It penalizes larger errors more than MAE, making it more sensitive to outliers.
    • An RMSE of 4585.42 units means the typical prediction error is around this magnitude.
  • Interpretation:
    The RMSE value is higher than the MAE, indicating the presence of some larger prediction errors. If reducing these larger errors is crucial (e.g., in a financial or medical context), the model may need improvement.


4. Overall Model Performance:

  • High R² with significant MAE and RMSE:
    Although the model explains a significant portion of the variance in the target variable (R² = 0.9749), the absolute error metrics (MAE and RMSE) suggest that the model's predictions still have notable deviations from actual values.

    • Possible Reasons:
      • The dataset might have some outliers or noise that increase the prediction errors.
      • The linear model might not perfectly capture the relationship, especially if the actual relationship is nonlinear.
  • Use Case Context:
    The acceptability of MAE and RMSE depends on the range of the target variable and the application's requirements:

    • If the target variable values range from, say, 100,000 to 500,000, these error values may be small enough for practical purposes.
    • For smaller target ranges, the model might require further refinement (e.g., adding more predictors, trying a nonlinear model, or addressing outliers).

Recommendations:

  1. Check residuals:
    Plot residuals (actual vs. predicted errors) to identify patterns, outliers, or systematic deviations.

  2. Normalize errors:
    If the target values vary significantly, consider calculating normalized errors, such as Mean Absolute Percentage Error (MAPE), to provide a more relative perspective.

  3. Model refinement:
    If error values are large relative to the data range or application requirements, consider improving the model by:

    • Adding more relevant predictors.
    • Trying polynomial or nonlinear regression if relationships aren't linear.
    • Addressing outliers or noise in the dataset.


Monday, 9 December 2024

Essential Data Transformation Techniques for Machine Learning: Log Transformation, Scaling, and Encoding

 Several methods must be used to convert raw data for analysis in order to make it better suited for modeling and to guarantee that machine learning algorithms' presumptions are satisfied. I'll go over every transformation method in brief below, along with examples of how to use it on datasets like the Car Price or Iris datasets. Before executing the following code, watch the video here

 1. Mount the Google Drive


 2. Read the Dataset


3. Print top 5 row to know about data

The dataset contains the following columns:

  • sepal_length, sepal_width, petal_length, petal_width: Numerical features.
  • species: Categorical feature.

4. Log Transformation

  • Purpose:

    • Logarithmic transformations are used to reduce skewness in numerical data, particularly when the data is highly skewed or has outliers.
    • It compresses large values while expanding smaller ones, bringing the distribution closer to a normal distribution.
  • How it works:

    • We use the formula: log-transformed value=log(1+x)
    • Adding 1 ensures the log function handles zero values without errors.
    • This is applied to numerical columns: sepal_length, sepal_width, petal_length, and petal_width.

5. Scaling

Scaling adjusts the range and distribution of numerical data to make it suitable for machine learning models and statistical analysis.

a. Min-Max Scaling

  • Purpose:

    • Rescales the values of numerical features to lie within a specific range, typically [0, 1].
    • Useful for algorithms sensitive to the scale of the data, such as gradient-based optimization algorithms.
  • How it works:

    • For a value , the formula is: xscaled=xmin(x)max(x)min(x)x_{\text{scaled}} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}
    • This transformation ensures all values are within the range [0, 1].
  • Applied to:

    • Numerical columns after log transformation.

 

b. Standardization (Z-Score Normalization)

  • Purpose:

    • Centers the data around a mean of 0 and scales it to have a standard deviation of 1.
    • Suitable when the data is assumed to follow a normal distribution or when preserving relative distances is critical.
  • How it works:

    • For a value xx, the formula is: z=xμσz = \frac{x - \mu}{\sigma} where \mu is the mean and \sigma is the standard deviation.
  • Applied to:

    • Numerical columns after log transformation.

 

6. Feature Encoding

Feature encoding is used to convert categorical variables into numerical formats that can be used by machine learning models.

a. One-Hot Encoding

  • Purpose:

    • Converts a categorical variable into multiple binary (0/1) columns, where each column represents a category.
    • Ensures no ordinal relationship is assumed between categories, making it suitable for nominal data.
  • How it works:

    • For a categorical variable species with three unique values: setosa, versicolor, and virginica:
      • Create binary columns: species_versicolor and species_virginica (dropping the first category to avoid redundancy).
      • A row with species = setosa is represented as: [species_versicolor=0,species_virginica=0][species\_versicolor = 0, species\_virginica = 0]


 

b. Label Encoding

  • Purpose:

    • Converts a categorical variable into a single numerical column by assigning each category a unique integer label.
    • Suitable for ordinal categorical variables where the order of categories has meaning.
  • How it works:

    • For species:
      • Assign labels such as:
        • setosa = 0
        • versicolor = 1
        • virginica = 2


Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top