Explanation of the Implementation Steps
Dataset Creation:
- We generate a synthetic binary classification dataset using
make_classification
. This simulates a loan prediction scenario with a mix of informative and redundant features.
- We generate a synthetic binary classification dataset using
Data Splitting:
- The data is split into training (70%) and testing (30%) sets to evaluate the model on unseen data.
Effect of Number of Trees:
- We train several Random Forest classifiers by varying the number of trees (
n_estimators
) over a range (10, 50, 100, 200, 500). - For each model, we compute:
- Train Accuracy: How well the model fits the training data.
- Test Accuracy: Model performance on unseen test data.
- Cross-Validation Accuracy: Average accuracy from a 5-fold CV on the training set.
- These metrics are plotted to analyze how increasing the number of trees impacts performance.
- We train several Random Forest classifiers by varying the number of trees (
Final Model Training:
- The best number of trees (based on test accuracy) is selected.
- A final Random Forest model is trained using this optimal setting, and its test accuracy is reported.
Feature Importance Extraction:
- The feature importances from the final model are extracted and visualized using a horizontal bar plot. This helps in understanding which features contribute the most to the prediction.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
# 1. Create a synthetic binary classification dataset (similar to a loan prediction task)
# - 1000 samples, 10 features (7 informative, 2 redundant)
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=7, n_redundant=2,
n_classes=2, random_state=42)
# 2. Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
# 3. Analyze effect of the number of trees on model performance
n_trees = [10, 50, 100, 200, 500]
train_acc = []
test_acc = []
cv_acc = []
for n in n_trees:
rf = RandomForestClassifier(n_estimators=n, random_state=42)
rf.fit(X_train, y_train)
# Evaluate on training data
train_acc.append(accuracy_score(y_train, rf.predict(X_train)))
# Evaluate on testing data
test_acc.append(accuracy_score(y_test, rf.predict(X_test)))
# 5-fold cross-validation on training data
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
cv_acc.append(np.mean(cv_scores))
# Plot accuracy versus number of trees
plt.figure(figsize=(10, 6))
plt.plot(n_trees, train_acc, marker='o', label='Train Accuracy')
plt.plot(n_trees, test_acc, marker='o', label='Test Accuracy')
plt.plot(n_trees, cv_acc, marker='o', label='CV Accuracy')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Effect of Number of Trees on Random Forest Performance')
plt.legend()
plt.grid(True)
plt.show()
# Identify best number of trees based on test accuracy
best_n = n_trees[np.argmax(test_acc)]
print("Best n_estimators based on test accuracy:", best_n)
# 4. Train final Random Forest model with the chosen number of trees
final_rf = RandomForestClassifier(n_estimators=best_n, random_state=42)
final_rf.fit(X_train, y_train)
final_test_accuracy = accuracy_score(y_test, final_rf.predict(X_test))
print("Final Test Accuracy: {:.2f}%".format(final_test_accuracy * 100))
# 5. Extract feature importances and visualize them
importances = final_rf.feature_importances_
feature_names = [f'Feature {i}' for i in range(X.shape[1])]
indices = np.argsort(importances)
plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.barh(range(len(importances)), importances[indices], align='center')
plt.yticks(range(len(importances)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Best n_estimators based on test accuracy: 50 Final Test Accuracy: 90.67%
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.