One of the most common LLM interview questions in 2026: “Would you fine-tune a model or train from scratch?” Almost always the answer is fine-tune — but the nuance is in how to fine-tune, and when fine-tuning itself is overkill compared to prompt engineering or RAG.
What the Interviewer Is Testing
Do you understand the economics and tradeoffs of the LLM adaptation stack? Can you reason about when each approach (prompt engineering → RAG → fine-tuning → pretraining) is appropriate? Do you know LoRA vs full fine-tuning and why it matters practically?
Training from Scratch: When and Why (Almost Never)
Training a frontier LLM from scratch:
- GPT-4: estimated $50–100M in compute alone
- LLaMA-3 70B: thousands of A100 GPU-hours = millions of dollars
- Data: trillions of tokens of carefully curated text
- Team: dozens of researchers, months of iteration
You train from scratch when:
- Your domain is so specialized that existing pretraining corpora are inadequate (genomics sequences, specialized code, proprietary legal language)
- You need a model too small for general LLMs to be competitive at your latency target (on-device models, <1B parameters)
- You’re a foundation model company (OpenAI, Anthropic, Google DeepMind)
In every other case — which means 99%+ of enterprise and product ML work — you start from a pretrained checkpoint.
The Adaptation Stack
From least to most expensive, in the order you should try them:
1. Prompt Engineering / System Prompts (free, minutes)
2. Few-Shot Prompting (free, hours)
3. RAG (Retrieval-Augmented Generation) (infrastructure cost, days)
4. Fine-Tuning (LoRA / QLoRA) (GPU cost, days to weeks)
5. Full Fine-Tuning (significant GPU cost, weeks)
6. Continued Pretraining (very large GPU cost, months)
Start at the top, descend only when you’ve exhausted the current level. Many teams jump straight to fine-tuning when prompt engineering would have solved the problem in a day.
When Fine-Tuning Beats Prompt Engineering / RAG
Fine-tuning is the right answer when:
- Style/format consistency: You need every output to follow a precise structure (JSON schema, specific tone, domain vocabulary) and few-shot examples aren’t reliable enough.
- Latency: You can’t afford long system prompts or retrieved context at inference time. A fine-tuned model encodes knowledge in weights, not context.
- Knowledge that doesn’t fit in context: If your knowledge base is 10M documents, you can’t stuff them all into a RAG retrieval window. Fine-tuning on distilled summaries can embed the knowledge.
- Cost: A fine-tuned smaller model (7B params) serving at $0.001/call beats a prompted GPT-4 at $0.06/call when you have millions of daily requests.
- Data privacy: You can’t send proprietary data to an external API. Fine-tune a local model.
RAG is better when: your knowledge changes frequently (fine-tuning requires retraining on updates), or when attribution (“here’s where this answer came from”) is required.
Full Fine-Tuning
Update all model parameters on your labeled dataset. Highest potential accuracy. Also most expensive: requires storing gradients and optimizer state for every parameter (3–4× model size in memory). A 7B parameter model needs ~28–56GB GPU RAM just for fine-tuning (vs ~14GB for inference).
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
# Standard fine-tuning — all 8B parameters get gradients
training_args = TrainingArguments(
output_dir="./ft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
fp16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Risk: catastrophic forgetting — fine-tuning on a narrow task can degrade general capabilities. Use a low learning rate (1e-5 to 3e-5) and mix in general-purpose data.
LoRA: Parameter-Efficient Fine-Tuning
LoRA (Low-Rank Adaptation, Hu et al. 2022) is the dominant fine-tuning technique in production. Instead of updating all parameters, it freezes the pretrained weights and adds small low-rank matrices to the attention layers:
W_updated = W_pretrained + B × A
# W_pretrained: d × d (frozen)
# A: d × r (trainable, random init)
# B: r × d (trainable, zero init)
# r (rank): typically 4–64, much smaller than d (768–4096)
Trainable parameters: 2 × d × r per layer vs d² for full fine-tuning. At rank 16, LoRA uses ~0.1–1% of the parameters of full fine-tuning — 100–1000× fewer parameters to train, store, and serve.
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more expressive, more parameters
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,999,040 || trainable: 0.06%
At inference time: merge the LoRA matrices into the base weights (B×A added to W) — zero inference overhead. Or keep them separate to hot-swap different LoRA adapters (one per customer or use case) on the same base model.
QLoRA: Fine-Tuning on Consumer Hardware
QLoRA (Dettmers et al. 2023) combines LoRA with 4-bit quantization of the base model. A 7B model that needs 28GB GPU RAM for full fine-tuning fits in 6–8GB. This opened up fine-tuning to single consumer GPUs (RTX 3090, 4090).
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — better than int4 for neural weights
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Now apply LoRA on top of the quantized model
model = get_peft_model(model, lora_config)
QLoRA quality is close to full fine-tuning in most benchmarks. For production fine-tuning on a single GPU, QLoRA is the default starting point.
Instruction Fine-Tuning and RLHF
A base LLM (raw pretraining) completes text — it doesn’t follow instructions. To make it assistant-like:
- Instruction fine-tuning (SFT — Supervised Fine-Tuning): Fine-tune on (instruction, response) pairs. The model learns to respond to user requests rather than just continuing text. LLaMA → LLaMA-Instruct via SFT on high-quality instruction datasets.
- RLHF (Reinforcement Learning from Human Feedback): Collect human preference data (which response is better?). Train a reward model on preferences. Use PPO to optimize the LLM against the reward model. Aligns the model to human preferences. Used by ChatGPT, Claude, Gemini.
- DPO (Direct Preference Optimization, 2023): Achieves RLHF-like alignment without training a separate reward model or running RL. Simpler, more stable. Increasingly preferred over PPO-based RLHF for production alignment.
Dataset Size Requirements
| Task | Minimum examples | Notes |
|---|---|---|
| Prompt engineering | 0–5 | In-context examples |
| LoRA fine-tuning (style/format) | 100–1,000 | Quality matters more than quantity |
| LoRA fine-tuning (domain knowledge) | 1,000–50,000 | Depends on domain complexity |
| Full fine-tuning | 10,000–100,000+ | Less risk of overfitting at scale |
| Continued pretraining | Billions of tokens | Full corpus of domain documents |
Evaluation
Fine-tuning without evaluation is reckless. Measure:
- Task-specific metrics: ROUGE for summarization, BLEU for translation, F1 for extraction, accuracy for classification
- General capability regression: Run standard benchmarks (MMLU, HellaSwag) before and after — ensure fine-tuning didn’t degrade general reasoning
- Human evaluation: For open-ended generation, have annotators rate quality; automated metrics miss quality in ways humans catch
Common Interview Mistakes
- Jumping to fine-tuning without considering if prompt engineering or RAG would solve the problem
- Not mentioning LoRA — saying “fine-tune the model” without specifying the technique signals unfamiliarity with production practice
- Ignoring catastrophic forgetting risk in full fine-tuning
- Treating RLHF and SFT as interchangeable — they serve different purposes
- Not considering inference cost: a fine-tuned 7B model at $0.001/call vs GPT-4 at $0.06/call matters at scale
Related ML Topics
- How Transformer Models Work — the architecture being fine-tuned; understanding attention heads helps choose which modules to apply LoRA to
- Overfitting and Regularization — catastrophic forgetting is a form of overfitting; low learning rate and mixed training data are the remedies
- Cross-Validation Strategies — evaluating fine-tuned LLMs requires held-out test sets and benchmark regressions, not just validation loss
- How to Choose Between ML Models — the decision of fine-tune vs RAG vs prompt engineering is the LLM-specific version of the model selection framework