How to Choose Between Machine Learning Models

One of the most common ML interview questions isn’t about a specific algorithm — it’s “how do you decide which model to use?” The wrong answer lists every algorithm you know. The right answer is a decision framework driven by data characteristics, business constraints, and what you’d lose by being wrong.

What the Interviewer Is Testing

Can you reason about trade-offs rather than cargo-culting “use XGBoost for everything”? Do you understand the practical considerations that dominate in production — interpretability requirements, inference latency, training data size, feature types — not just academic performance?

The Framework: Five Questions First

Before touching an algorithm, answer these:

  1. How much labeled data do you have? Hundreds of rows → simple models. Millions → complex models. Billions of sequences → transformers.
  2. What does the input look like? Tabular (structured features), text, images, time series, or graphs — this heavily constrains the choice.
  3. Does the model need to be interpretable? A credit score model used to deny a loan must be explainable to the applicant (regulatory requirement in US and EU). A recommendation model doesn’t.
  4. What are the inference constraints? Latency <1ms (ad ranking on mobile), throughput of 10K/sec (fraud scoring), or batch overnight are very different requirements.
  5. What does a mistake cost? False negatives vs false positives trade-off drives both metric choice and model complexity tolerance.

Tabular Data: The Most Common Case

Most production ML is tabular: rows of mixed numeric, categorical, and boolean features. The landscape from simple to complex:

Logistic/Linear Regression — Start Here

Always fit a linear model first. Reasons:

  • Training is fast (seconds on millions of rows)
  • Coefficients are interpretable: “being over 30 days delinquent increases default probability by 3.2×”
  • Easily debuggable — you can inspect exactly which features drove a prediction
  • Required by regulation in many industries (ECOA, GDPR Article 22)

When it fails: when the decision boundary is non-linear. You can extend it with interaction terms and polynomial features, but this quickly becomes unwieldy.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(C=1.0, max_iter=1000))
])
pipe.fit(X_train, y_train)

# Interpretability: inspect coefficients
coef_df = pd.DataFrame({'feature': X.columns, 'coef': pipe['model'].coef_[0]})
print(coef_df.sort_values('coef', ascending=False))

Decision Trees and Random Forests

Use when:

  • Features are mixed types (numeric + categorical) — trees handle this natively without encoding or scaling
  • Non-linear relationships between features and target
  • You need some interpretability (single decision tree is fully interpretable; random forest less so but feature importance is useful)
  • You have outliers or don’t want to worry about feature scaling

Random forests are robust defaults. When in doubt about which tree model to use, start here.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=5,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

Gradient Boosting (XGBoost / LightGBM / CatBoost)

The go-to for tabular data competitions and production systems where maximum predictive accuracy matters. Typically outperforms random forest on tabular data when properly tuned.

When to prefer over random forest:

  • You have time to tune (learning rate, depth, min_child_weight — 3–5 parameters matter)
  • You want the best possible accuracy on tabular data
  • You have sparse features (LightGBM handles sparse input efficiently)

When to stick with random forest:

  • You need a quick baseline with minimal tuning
  • Inference latency is critical — random forests can be parallelized; boosting is sequential
  • The extra accuracy from boosting doesn’t justify the tuning effort
import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    min_child_samples=20,
    class_weight='balanced',
    random_state=42
)
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)])

Unstructured Data: Images, Text, Audio

For unstructured data, classical ML (logistic regression, trees) rarely works well. Deep learning is the default:

  • Images: CNNs for classification/detection. Fine-tune a pretrained model (ResNet, EfficientNet, Vision Transformer). Rarely train from scratch unless you have millions of images.
  • Text: Transformer models (BERT, RoBERTa) for classification. Fine-tune on your labeled data. For generation, GPT-style models. For semantic search, embedding models + cosine similarity.
  • Time series: Feature engineering + gradient boosting is often competitive with LSTM/Transformer for tabular time series. Pure sequence models win for raw signals (ECG, audio waveforms).

Rule of thumb: if a domain expert with Excel could manually engineer the features, start with tabular ML. If the signal is in the raw structure (pixel arrangement, word order), use deep learning.

Bias-Variance Perspective on Model Choice

Every model choice is a bias-variance trade-off:

Model Bias Variance Best when
Linear/Logistic High Low Small data, linear relationships
Decision Tree (deep) Low High Avoid alone — use ensemble
Random Forest Low Medium Moderate data, robust default
Gradient Boosting Low Low-Medium Large tabular data, tuning available
Deep Neural Network Very low Very high (without regularization) Very large data, unstructured input

Small dataset + complex model = high variance = overfitting. The answer to “which model should I use with 500 rows?” is almost always “logistic regression or regularized linear model.”

The Process: Start Simple, Add Complexity Only When Needed

1. Establish a baseline (most frequent class, mean, or a single rule)
2. Fit logistic/linear regression → measure gap from baseline
3. Fit random forest → measure gap from linear model
4. If gap is significant and you have tuning budget: try gradient boosting
5. If still unsatisfied AND you have large data + unstructured input: try deep learning
6. At each step: ask "is this improvement worth the added complexity and maintenance cost?"

A 0.5% AUC improvement from moving from random forest to XGBoost is worth it if this is fraud detection at Visa scale ($billions at stake). It’s not worth it if this is an internal churn model with 50K users.

Practical Interview Answer Template

When asked “how would you approach model selection for problem X”:

  1. State the data type and size: “We have 5M rows of tabular data with mixed numeric and categorical features.”
  2. State the constraint: “The model will run in real-time at 10K/sec, so inference must be <1ms. That rules out deep models.”
  3. State the interpretability requirement: “This is a credit decision model — it needs to be explainable. I’d use logistic regression or a shallow gradient boosted tree.”
  4. Propose an experiment: “I’d start with a regularized logistic regression as baseline, then test LightGBM, and compare both on F1 and calibration using 5-fold stratified CV.”
  5. Define stopping criteria: “If the uplift from LightGBM over logistic regression is >3% F1 and we can add SHAP explanations to meet interpretability requirements, we go with boosting.”

Common Interview Mistakes

  • Jumping straight to “I’d use XGBoost” without asking about data size, input type, or constraints
  • Not establishing a baseline — you can’t know if your model is good without one
  • Choosing deep learning for tabular data with 10K rows
  • Ignoring inference latency — a model that takes 200ms per prediction can’t run in an ad auction
  • Treating model selection as a pure accuracy optimization problem, ignoring business constraints

Related ML Topics

Scroll to Top