Q&A 5 How do you train a decision tree for prediction?

5.1 Explanation

Decision trees are flexible models that recursively split the dataset based on feature values to form decision rules. For classification, they predict a class label; for regression, they predict a continuous value. Trees are interpretable and can handle both linear and non-linear patterns.

5.2 Python Code

5.3 R Code

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load and subset iris dataset for binary classification
df = pd.read_csv("data/iris.csv")

# ✅ Use .copy() to avoid SettingWithCopyWarning
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()

# Convert labels to binary (0 = setosa, 1 = versicolor)
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})

# Split into train/test
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train decision tree classifier
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)

# Predict and evaluate
y_pred = tree.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0

5.4 R Code

library(readr)
library(caret)
library(rpart)

# Load and subset iris dataset for binary classification
df <- read_csv("data/iris.csv")
df_bin <- subset(df, species %in% c("setosa", "versicolor"))
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))

# Split into train/test
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]

# Train decision tree
tree_model <- rpart(species ~ ., data = train, method = "class", control = rpart.control(maxdepth = 3))

# Predict and evaluate
predicted <- predict(tree_model, newdata = test, type = "class")
confusionMatrix(predicted, test$species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor
  setosa         15          0
  versicolor      0         15
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 9.313e-10  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : setosa