Q&A 5 How do you train a decision tree for prediction?
5.1 Explanation
Decision trees are flexible models that recursively split the dataset based on feature values to form decision rules. For classification, they predict a class label; for regression, they predict a continuous value. Trees are interpretable and can handle both linear and non-linear patterns.
5.3 R Code
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Load and subset iris dataset for binary classification
df = pd.read_csv("data/iris.csv")
# ✅ Use .copy() to avoid SettingWithCopyWarning
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()
# Convert labels to binary (0 = setosa, 1 = versicolor)
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})
# Split into train/test
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train decision tree classifier
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
# Predict and evaluate
y_pred = tree.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 1.0
5.4 R Code
library(readr)
library(caret)
library(rpart)
# Load and subset iris dataset for binary classification
df <- read_csv("data/iris.csv")
df_bin <- subset(df, species %in% c("setosa", "versicolor"))
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))
# Split into train/test
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]
# Train decision tree
tree_model <- rpart(species ~ ., data = train, method = "class", control = rpart.control(maxdepth = 3))
# Predict and evaluate
predicted <- predict(tree_model, newdata = test, type = "class")
confusionMatrix(predicted, test$species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor
setosa 15 0
versicolor 0 15
Accuracy : 1
95% CI : (0.8843, 1)
No Information Rate : 0.5
P-Value [Acc > NIR] : 9.313e-10
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0
Specificity : 1.0
Pos Pred Value : 1.0
Neg Pred Value : 1.0
Prevalence : 0.5
Detection Rate : 0.5
Detection Prevalence : 0.5
Balanced Accuracy : 1.0
'Positive' Class : setosa