Implementation Steps Explained
Classification Task:
- Dataset Loading:
- The Iris dataset is loaded, containing 150 samples from 3 classes.
- Train-Test Split:
- The dataset is split (70% training, 30% testing) using stratified sampling to preserve class proportions.
- A k-NN classifier is trained and predictions are made on the test set.
- Metrics: Accuracy and weighted F1 Score are computed.
- k-Fold Cross-Validation:
- A 5-Fold Stratified CV is set up to ensure each fold has a representative distribution of classes.
- The classifier is evaluated across folds, and average accuracy and F1 Score (with standard deviation) are reported.
Regression Task:
- Dataset Loading:
- The California Housing dataset is used, a regression dataset with housing prices.
- Train-Test Split:
- The dataset is split (70% training, 30% testing).
- A Decision Tree Regressor is trained and predictions are made on the test set.
- Metrics: Mean Squared Error (MSE) and R² Score are computed.
- k-Fold Cross-Validation:
- A 5-Fold CV is performed using the KFold splitter.
- Using
cross_validate
, both MSE (noting that scikit-learn returns negative MSE) and R² Score are calculated for each fold. - The average metrics along with their standard deviations are reported.
This approach demonstrates how to evaluate a model using both the Train-Test Split and k-Fold Cross-Validation, providing a more robust understanding of model performance using various metrics for classification and regression tasks.
import numpy as np
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, make_scorer, mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
###############################
# Classification Example
# Dataset: Iris
# Model: k-NN Classifier
# Metrics: Accuracy, F1 Score
###############################
print("### Classification Task: Iris Dataset with k-NN ###")
# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# 2. Train-Test Split approach
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Evaluate using Train-Test Split metrics
acc_test = accuracy_score(y_test, y_pred)
f1_test = f1_score(y_test, y_pred, average='weighted')
print("\nTrain-Test Split Results:")
print("Accuracy: {:.2f}%".format(acc_test * 100))
print("F1 Score: {:.2f}".format(f1_test))
# 3. k-Fold Cross-Validation approach (Stratified for classification)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {'accuracy': 'accuracy', 'f1': make_scorer(f1_score, average='weighted')}
cv_results = cross_validate(knn, X, y, cv=cv, scoring=scoring)
# Report cross-validation metrics
cv_acc = np.mean(cv_results['test_accuracy'])
cv_acc_std = np.std(cv_results['test_accuracy'])
cv_f1 = np.mean(cv_results['test_f1'])
cv_f1_std = np.std(cv_results['test_f1'])
print("\nk-Fold Cross-Validation Results:")
print("Accuracy: {:.2f}% (std: {:.2f}%)".format(cv_acc * 100, cv_acc_std * 100))
print("F1 Score: {:.2f} (std: {:.2f})".format(cv_f1, cv_f1_std))
Output:
In this case the Train-Test Split Approach is better than k-Fold Cross Validation
###############################
# Regression Example
# Dataset: California Housing
# Model: Decision Tree Regressor
# Metrics: Mean Squared Error (MSE), R² Score
###############################
print("\n### Regression Task: California Housing Dataset with Decision Tree Regressor ###")
# 1. Load the California Housing dataset
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target
# 2. Train-Test Split approach
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_reg, y_reg, test_size=0.3, random_state=42)
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train_reg, y_train_reg)
y_pred_reg = tree_reg.predict(X_test_reg)
# Evaluate using Train-Test Split metrics
mse_test = mean_squared_error(y_test_reg, y_pred_reg)
r2_test = r2_score(y_test_reg, y_pred_reg)
print("\nTrain-Test Split Results:")
print("MSE: {:.2f}".format(mse_test))
print("R² Score: {:.2f}".format(r2_test))
# 3. k-Fold Cross-Validation approach for regression
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scoring_reg = {'mse': 'neg_mean_squared_error', 'r2': 'r2'}
cv_results_reg = cross_validate(tree_reg, X_reg, y_reg, cv=kf, scoring=scoring_reg)
# Note: For MSE, cross_validate returns negative values, so we convert them back.
mse_cv = -np.mean(cv_results_reg['test_mse'])
r2_cv = np.mean(cv_results_reg['test_r2'])
mse_cv_std = np.std(-cv_results_reg['test_mse'])
r2_cv_std = np.std(cv_results_reg['test_r2'])
print("\nk-Fold Cross-Validation Results:")
print("MSE: {:.2f} (std: {:.2f})".format(mse_cv, mse_cv_std))
print("R² Score: {:.2f} (std: {:.2f})".format(r2_cv, r2_cv_std))
Output
In this case K-Fold Cross Validation is better than Train-Test Split
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.