Train/Test/Validation Split: The Right Way and Common Mistakes

⏱ 6 min read

Train/test/validation splits are foundational — and routinely misunderstood. The most common mistake in applied ML is using the test set to make decisions, which invalidates the entire evaluation. This post covers what each split is for, how to size them, and the subtle mistakes that corrupt your results without being obvious.

What the Interviewer Is Testing

Do you actually understand what the test set is for, and why “I got 94% on the test set” can be a meaningless number? Can you explain the three-way split, when to use cross-validation instead, and the specific failure modes for time series?

The Three Splits and Why You Need All Three

Training set: The data the model sees during training. Used to fit parameters (weights, tree splits, coefficients). The model is allowed to memorize this data.

Validation set: Data the model does not train on, used to make decisions: which hyperparameters? Which model architecture? When to stop training? Every time you look at the validation score and make a choice, you’re implicitly fitting to the validation set.

Test set: Data the model has never influenced, used exactly once — to report final performance. You do not make any decisions based on the test set score. If you look at the test score and decide to tune further, the test set is no longer a valid holdout.

The critical insight: the test set estimates how well your model performs on truly new data. Every time you use the test set to make a decision, you leak information about it into your model selection process, and the test score becomes an optimistic estimate.

Why Two Splits Aren’t Enough

With only train and test:

You train on training data
You check test score → decide to tune hyperparameters
You tune, check test score again → pick the best configuration
You report the test score

This test score is now biased. You’ve effectively trained your hyperparameters on the test set, even though the model weights didn’t see it. In practice, after 10 rounds of “tune → check test → tune again,” your test score can be 5–15% more optimistic than true generalization performance.

Adding a validation set: tune hyperparameters using validation score, keep test set truly held out, report test score only at the very end. The test score is now an unbiased estimate of true generalization.

Split Sizing

The right split sizes depend on dataset size:

Dataset size	Train	Validation	Test	Notes
Small (<10K rows)	60%	20%	20%	Consider cross-validation instead — holdout too noisy
Medium (10K–1M)	70–80%	10–15%	10–15%	Standard split
Large (>1M)	98%	1%	1%	1% of 10M rows = 100K — plenty for reliable estimates

For large datasets, you don’t need 20% held out — even 1% gives very precise estimates. Putting 99% in training is almost always better than holding out “just to be safe.”

How to Split: Code

from sklearn.model_selection import train_test_split

# Step 1: Split off test set first — never touch it again until final evaluation
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)

# Step 2: Split remaining into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
    # 0.176 of 0.85 ≈ 0.15 of total
)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
print(f"Positive rate — Train: {y_train.mean():.3f}, Val: {y_val.mean():.3f}, Test: {y_test.mean():.3f}")

stratify=y is critical for classification — it ensures each split has the same positive class rate. Without it, you might get a validation set with no positive examples by chance, making validation metrics meaningless.

Cross-Validation vs Hold-Out Validation

For small datasets (<10K rows), a fixed validation set gives noisy estimates. K-fold cross-validation is more reliable: train k times, test on k different validation folds, average the scores. Variance in the estimate drops by ~k.

from sklearn.model_selection import StratifiedKFold, cross_val_score

# For small datasets: use CV for model selection
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train_full, y_train_full, cv=skf, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Still keep a test set for final evaluation
# Only call this once, after all decisions are made:
final_score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {final_score:.3f}")

Time Series: The Shuffle Trap

Never randomly shuffle time series data before splitting. A model trained on data from January through December, with a random 20% held out, will “test” on data from throughout the year — some of which was in the future relative to its training neighbors. This is leakage.

The chronological split rule: training data must always be strictly older than validation, which must be strictly older than test.

import pandas as pd

df = df.sort_values('timestamp').reset_index(drop=True)
n = len(df)

train = df.iloc[:int(0.7 * n)]
val   = df.iloc[int(0.7 * n):int(0.85 * n)]
test  = df.iloc[int(0.85 * n):]

# Or by date
train = df[df.timestamp = '2025-07-01') & (df.timestamp = '2025-10-01']

Also consider adding a gap between train and val (and val and test) equal to the prediction horizon. If you’re predicting 7-day churn, samples within 7 days of the train cutoff may have labels that “bleed” future information. Drop them.

Data Leakage in Preprocessing

The most common source of inflated validation scores: fitting preprocessing on the full dataset, then splitting. The scaler, imputer, or encoder fitted on the full data “knows” statistics from the validation and test sets, contaminating training.

# WRONG — scaler sees validation data
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_train, X_val = train_test_split(X_scaled)

# RIGHT — scaler fitted only on training data
X_train, X_val = train_test_split(X)
scaler = StandardScaler().fit(X_train)       # fit on train only
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)              # transform with train statistics

# BEST — use Pipeline so CV handles this automatically
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
cross_val_score(pipe, X, y, cv=5)  # scaler refitted on each train fold

The Test Set Is Sacred

Practical rules for the test set:

Split it off first, before any exploration or feature engineering
Do not look at test set labels during development — not even for summary statistics
Report the test score once. If you report it, note any re-evaluations — each one increases the chance the number is optimistic
In a competition (Kaggle): the public leaderboard is a validation set. The private leaderboard is the test set. “Overfitting to the public leaderboard” means treating a validation set as a test set

Group Splits for Non-Independent Data

If your data has groups (multiple rows per user, patient, or session), a random split will place the same user in both training and validation. The model learns user-specific patterns and validation scores are inflated.

from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=user_ids))

X_train, X_test = X[train_idx], X[test_idx]
# Guarantees no user appears in both train and test

Common Interview Mistakes

Confusing validation set and test set — using test set for hyperparameter decisions
Randomly splitting time series data
Fitting preprocessors on full data before splitting (leakage)
Not stratifying for imbalanced classification
Making the test set too small (100 test samples gives ±10% confidence interval on accuracy)
Not accounting for group structure (multiple rows per entity) in splits

Cross-Validation Strategies — when to use CV instead of a fixed validation split, and how to avoid leakage inside CV
Handling Imbalanced Datasets — always stratify splits for imbalanced classes; apply resampling inside CV folds
Feature Selection and Dimensionality Reduction — feature selection must happen inside each CV fold, not before splitting
How to Choose Between Models — model selection uses validation scores; the test set confirms the winner