Train/test/validation splits are foundational — and routinely misunderstood. The most common mistake in applied ML is using the test set to make decisions, which invalidates the entire evaluation. This post covers what each split is for, how to size them, and the subtle mistakes that corrupt your results without being obvious.
What the Interviewer Is Testing
Do you actually understand what the test set is for, and why “I got 94% on the test set” can be a meaningless number? Can you explain the three-way split, when to use cross-validation instead, and the specific failure modes for time series?
The Three Splits and Why You Need All Three
Training set: The data the model sees during training. Used to fit parameters (weights, tree splits, coefficients). The model is allowed to memorize this data.
Validation set: Data the model does not train on, used to make decisions: which hyperparameters? Which model architecture? When to stop training? Every time you look at the validation score and make a choice, you’re implicitly fitting to the validation set.
Test set: Data the model has never influenced, used exactly once — to report final performance. You do not make any decisions based on the test set score. If you look at the test score and decide to tune further, the test set is no longer a valid holdout.
The critical insight: the test set estimates how well your model performs on truly new data. Every time you use the test set to make a decision, you leak information about it into your model selection process, and the test score becomes an optimistic estimate.
Why Two Splits Aren’t Enough
With only train and test:
- You train on training data
- You check test score → decide to tune hyperparameters
- You tune, check test score again → pick the best configuration
- You report the test score
This test score is now biased. You’ve effectively trained your hyperparameters on the test set, even though the model weights didn’t see it. In practice, after 10 rounds of “tune → check test → tune again,” your test score can be 5–15% more optimistic than true generalization performance.
Adding a validation set: tune hyperparameters using validation score, keep test set truly held out, report test score only at the very end. The test score is now an unbiased estimate of true generalization.
Split Sizing
The right split sizes depend on dataset size:
| Dataset size | Train | Validation | Test | Notes |
|---|---|---|---|---|
| Small (<10K rows) | 60% | 20% | 20% | Consider cross-validation instead — holdout too noisy |
| Medium (10K–1M) | 70–80% | 10–15% | 10–15% | Standard split |
| Large (>1M) | 98% | 1% | 1% | 1% of 10M rows = 100K — plenty for reliable estimates |
For large datasets, you don’t need 20% held out — even 1% gives very precise estimates. Putting 99% in training is almost always better than holding out “just to be safe.”
How to Split: Code
from sklearn.model_selection import train_test_split
# Step 1: Split off test set first — never touch it again until final evaluation
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=42
)
# Step 2: Split remaining into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
# 0.176 of 0.85 ≈ 0.15 of total
)
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
print(f"Positive rate — Train: {y_train.mean():.3f}, Val: {y_val.mean():.3f}, Test: {y_test.mean():.3f}")
stratify=y is critical for classification — it ensures each split has the same positive class rate. Without it, you might get a validation set with no positive examples by chance, making validation metrics meaningless.
Cross-Validation vs Hold-Out Validation
For small datasets (<10K rows), a fixed validation set gives noisy estimates. K-fold cross-validation is more reliable: train k times, test on k different validation folds, average the scores. Variance in the estimate drops by ~k.
from sklearn.model_selection import StratifiedKFold, cross_val_score
# For small datasets: use CV for model selection
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train_full, y_train_full, cv=skf, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Still keep a test set for final evaluation
# Only call this once, after all decisions are made:
final_score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Test AUC: {final_score:.3f}")
Time Series: The Shuffle Trap
Never randomly shuffle time series data before splitting. A model trained on data from January through December, with a random 20% held out, will “test” on data from throughout the year — some of which was in the future relative to its training neighbors. This is leakage.
The chronological split rule: training data must always be strictly older than validation, which must be strictly older than test.
import pandas as pd
df = df.sort_values('timestamp').reset_index(drop=True)
n = len(df)
train = df.iloc[:int(0.7 * n)]
val = df.iloc[int(0.7 * n):int(0.85 * n)]
test = df.iloc[int(0.85 * n):]
# Or by date
train = df[df.timestamp = '2025-07-01') & (df.timestamp = '2025-10-01']
Also consider adding a gap between train and val (and val and test) equal to the prediction horizon. If you’re predicting 7-day churn, samples within 7 days of the train cutoff may have labels that “bleed” future information. Drop them.
Data Leakage in Preprocessing
The most common source of inflated validation scores: fitting preprocessing on the full dataset, then splitting. The scaler, imputer, or encoder fitted on the full data “knows” statistics from the validation and test sets, contaminating training.
# WRONG — scaler sees validation data
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_train, X_val = train_test_split(X_scaled)
# RIGHT — scaler fitted only on training data
X_train, X_val = train_test_split(X)
scaler = StandardScaler().fit(X_train) # fit on train only
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val) # transform with train statistics
# BEST — use Pipeline so CV handles this automatically
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
cross_val_score(pipe, X, y, cv=5) # scaler refitted on each train fold
The Test Set Is Sacred
Practical rules for the test set:
- Split it off first, before any exploration or feature engineering
- Do not look at test set labels during development — not even for summary statistics
- Report the test score once. If you report it, note any re-evaluations — each one increases the chance the number is optimistic
- In a competition (Kaggle): the public leaderboard is a validation set. The private leaderboard is the test set. “Overfitting to the public leaderboard” means treating a validation set as a test set
Group Splits for Non-Independent Data
If your data has groups (multiple rows per user, patient, or session), a random split will place the same user in both training and validation. The model learns user-specific patterns and validation scores are inflated.
from sklearn.model_selection import GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=user_ids))
X_train, X_test = X[train_idx], X[test_idx]
# Guarantees no user appears in both train and test
Common Interview Mistakes
- Confusing validation set and test set — using test set for hyperparameter decisions
- Randomly splitting time series data
- Fitting preprocessors on full data before splitting (leakage)
- Not stratifying for imbalanced classification
- Making the test set too small (100 test samples gives ±10% confidence interval on accuracy)
- Not accounting for group structure (multiple rows per entity) in splits
Related ML Topics
- Cross-Validation Strategies — when to use CV instead of a fixed validation split, and how to avoid leakage inside CV
- Handling Imbalanced Datasets — always stratify splits for imbalanced classes; apply resampling inside CV folds
- Feature Selection and Dimensionality Reduction — feature selection must happen inside each CV fold, not before splitting
- How to Choose Between Models — model selection uses validation scores; the test set confirms the winner