Q&A 3 How do you train a classification model using logistic regression?

3.1 Explanation

Logistic regression is a linear model used for binary and multi-class classification problems. It estimates probabilities using the logistic function. For binary classification, the target must have only two classes. For multi-class, specialized implementations like multinomial logistic regression are used.

3.2 Python Code

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset from local path
df = pd.read_csv("data/iris.csv")

# Subset for binary classification (e.g., setosa vs versicolor)
binary_df = df[df["species"].isin(["setosa", "versicolor"])].copy()
binary_df["species"] = binary_df["species"].map({"setosa": 0, "versicolor": 1})

# Split into train/test
X = binary_df.drop("species", axis=1)
y = binary_df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate model
print("Accuracy:", model.score(X_test, y_test))

Accuracy: 1.0

3.3 R Code

library(readr)
library(caret)

# Load dataset
df <- read_csv("data/iris.csv")

# Subset for binary classification (setosa vs versicolor)
df_bin <- subset(df, species %in% c("setosa", "versicolor"))

# Convert species to binary factor with correct level order
df_bin$species <- factor(df_bin$species, levels = c("setosa", "versicolor"))

# Split into train/test
set.seed(42)
index <- createDataPartition(df_bin$species, p = 0.7, list = FALSE)
train <- df_bin[index, ]
test <- df_bin[-index, ]

# Fit logistic regression model — now works because species has 2 levels
model <- glm(species ~ sepal_length + sepal_width + petal_length + petal_width,
             data = train, family = "binomial")

# Predict probabilities
pred_probs <- predict(model, newdata = test, type = "response")

# Classify based on threshold
predicted <- factor(ifelse(pred_probs > 0.5, "versicolor", "setosa"),
                    levels = levels(test$species))

# Evaluate
confusionMatrix(predicted, test$species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor
  setosa         15          0
  versicolor      0         15
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 9.313e-10  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : setosa