Monday, 24 February 2025

Implementing Random Forest model for regression or classification.

Explanation of the Implementation Steps

  1. Dataset Creation:

    • We generate a synthetic binary classification dataset using make_classification. This simulates a loan prediction scenario with a mix of informative and redundant features.
  2. Data Splitting:

    • The data is split into training (70%) and testing (30%) sets to evaluate the model on unseen data.
  3. Effect of Number of Trees:

    • We train several Random Forest classifiers by varying the number of trees (n_estimators) over a range (10, 50, 100, 200, 500).
    • For each model, we compute:
      • Train Accuracy: How well the model fits the training data.
      • Test Accuracy: Model performance on unseen test data.
      • Cross-Validation Accuracy: Average accuracy from a 5-fold CV on the training set.
    • These metrics are plotted to analyze how increasing the number of trees impacts performance.
  4. Final Model Training:

    • The best number of trees (based on test accuracy) is selected.
    • A final Random Forest model is trained using this optimal setting, and its test accuracy is reported.
  5. Feature Importance Extraction:

    • The feature importances from the final model are extracted and visualized using a horizontal bar plot. This helps in understanding which features contribute the most to the prediction.

 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# 1. Create a synthetic binary classification dataset (similar to a loan prediction task)
#    - 1000 samples, 10 features (7 informative, 2 redundant)
X, y = make_classification(n_samples=1000, n_features=10,
                           n_informative=7, n_redundant=2,
                           n_classes=2, random_state=42)

# 2. Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# 3. Analyze effect of the number of trees on model performance
n_trees = [10, 50, 100, 200, 500]
train_acc = []
test_acc = []
cv_acc = []

for n in n_trees:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
   
    # Evaluate on training data
    train_acc.append(accuracy_score(y_train, rf.predict(X_train)))
    # Evaluate on testing data
    test_acc.append(accuracy_score(y_test, rf.predict(X_test)))
    # 5-fold cross-validation on training data
    cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
    cv_acc.append(np.mean(cv_scores))

# Plot accuracy versus number of trees
plt.figure(figsize=(10, 6))
plt.plot(n_trees, train_acc, marker='o', label='Train Accuracy')
plt.plot(n_trees, test_acc, marker='o', label='Test Accuracy')
plt.plot(n_trees, cv_acc, marker='o', label='CV Accuracy')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Effect of Number of Trees on Random Forest Performance')
plt.legend()
plt.grid(True)
plt.show()

# Identify best number of trees based on test accuracy
best_n = n_trees[np.argmax(test_acc)]
print("Best n_estimators based on test accuracy:", best_n)

# 4. Train final Random Forest model with the chosen number of trees
final_rf = RandomForestClassifier(n_estimators=best_n, random_state=42)
final_rf.fit(X_train, y_train)
final_test_accuracy = accuracy_score(y_test, final_rf.predict(X_test))
print("Final Test Accuracy: {:.2f}%".format(final_test_accuracy * 100))

# 5. Extract feature importances and visualize them
importances = final_rf.feature_importances_
feature_names = [f'Feature {i}' for i in range(X.shape[1])]
indices = np.argsort(importances)

plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.barh(range(len(importances)), importances[indices], align='center')
plt.yticks(range(len(importances)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()


 

Best n_estimators based on test accuracy: 50
Final Test Accuracy: 90.67%


0 comments :

Post a Comment

Note: only a member of this blog may post a comment.

Machine Learning

More

Advertisement

Java Tutorial

More

UGC NET CS TUTORIAL

MFCS
COA
PL-CG
DBMS
OPERATING SYSTEM
SOFTWARE ENG
DSA
TOC-CD
ARTIFICIAL INT

C Programming

More

Python Tutorial

More

Data Structures

More

computer Organization

More
Top