Q&A 2 How do you split a dataset into training and test sets?

2.1 Explanation

Splitting the data allows you to train a model on one portion and evaluate it on another, unseen portion. This helps estimate real-world performance.

## Python Code
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/iris.csv")
X = df.drop("species", axis=1)
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training set size:", len(X_train))

Training set size: 105

2.2 R Code

library(caret)
data <- readr::read_csv("data/iris.csv")
set.seed(42)
train_index <- createDataPartition(data$species, p = 0.7, list = FALSE)
train <- data[train_index, ]
test <- data[-train_index, ]
nrow(train)

[1] 105