Fine-tuning LLMs vs Training from Scratch: When and How

⏱ 6 min read

One of the most common LLM interview questions in 2026: “Would you fine-tune a model or train from scratch?” Almost always the answer is fine-tune — but the nuance is in how to fine-tune, and when fine-tuning itself is overkill compared to prompt engineering or RAG.

What the Interviewer Is Testing

Do you understand the economics and tradeoffs of the LLM adaptation stack? Can you reason about when each approach (prompt engineering → RAG → fine-tuning → pretraining) is appropriate? Do you know LoRA vs full fine-tuning and why it matters practically?

Training from Scratch: When and Why (Almost Never)

Training a frontier LLM from scratch:

GPT-4: estimated $50–100M in compute alone
LLaMA-3 70B: thousands of A100 GPU-hours = millions of dollars
Data: trillions of tokens of carefully curated text
Team: dozens of researchers, months of iteration

You train from scratch when:

Your domain is so specialized that existing pretraining corpora are inadequate (genomics sequences, specialized code, proprietary legal language)
You need a model too small for general LLMs to be competitive at your latency target (on-device models, <1B parameters)
You’re a foundation model company (OpenAI, Anthropic, Google DeepMind)

In every other case — which means 99%+ of enterprise and product ML work — you start from a pretrained checkpoint.

The Adaptation Stack

From least to most expensive, in the order you should try them:

1. Prompt Engineering / System Prompts   (free, minutes)
2. Few-Shot Prompting                    (free, hours)
3. RAG (Retrieval-Augmented Generation)  (infrastructure cost, days)
4. Fine-Tuning (LoRA / QLoRA)            (GPU cost, days to weeks)
5. Full Fine-Tuning                      (significant GPU cost, weeks)
6. Continued Pretraining                 (very large GPU cost, months)

Start at the top, descend only when you’ve exhausted the current level. Many teams jump straight to fine-tuning when prompt engineering would have solved the problem in a day.

When Fine-Tuning Beats Prompt Engineering / RAG

Fine-tuning is the right answer when:

Style/format consistency: You need every output to follow a precise structure (JSON schema, specific tone, domain vocabulary) and few-shot examples aren’t reliable enough.
Latency: You can’t afford long system prompts or retrieved context at inference time. A fine-tuned model encodes knowledge in weights, not context.
Knowledge that doesn’t fit in context: If your knowledge base is 10M documents, you can’t stuff them all into a RAG retrieval window. Fine-tuning on distilled summaries can embed the knowledge.
Cost: A fine-tuned smaller model (7B params) serving at $0.001/call beats a prompted GPT-4 at $0.06/call when you have millions of daily requests.
Data privacy: You can’t send proprietary data to an external API. Fine-tune a local model.

RAG is better when: your knowledge changes frequently (fine-tuning requires retraining on updates), or when attribution (“here’s where this answer came from”) is required.

Full Fine-Tuning

Update all model parameters on your labeled dataset. Highest potential accuracy. Also most expensive: requires storing gradients and optimizer state for every parameter (3–4× model size in memory). A 7B parameter model needs ~28–56GB GPU RAM just for fine-tuning (vs ~14GB for inference).

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

# Standard fine-tuning — all 8B parameters get gradients
training_args = TrainingArguments(
    output_dir="./ft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)
trainer.train()

Risk: catastrophic forgetting — fine-tuning on a narrow task can degrade general capabilities. Use a low learning rate (1e-5 to 3e-5) and mix in general-purpose data.

LoRA: Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation, Hu et al. 2022) is the dominant fine-tuning technique in production. Instead of updating all parameters, it freezes the pretrained weights and adds small low-rank matrices to the attention layers:

W_updated = W_pretrained + B × A
# W_pretrained: d × d (frozen)
# A: d × r  (trainable, random init)
# B: r × d  (trainable, zero init)
# r (rank): typically 4–64, much smaller than d (768–4096)

Trainable parameters: 2 × d × r per layer vs d² for full fine-tuning. At rank 16, LoRA uses ~0.1–1% of the parameters of full fine-tuning — 100–1000× fewer parameters to train, store, and serve.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                   # rank — higher = more expressive, more parameters
    lora_alpha=32,          # scaling factor
    target_modules=["q_proj", "v_proj"],   # which layers to adapt
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,999,040 || trainable: 0.06%

At inference time: merge the LoRA matrices into the base weights (B×A added to W) — zero inference overhead. Or keep them separate to hot-swap different LoRA adapters (one per customer or use case) on the same base model.

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA (Dettmers et al. 2023) combines LoRA with 4-bit quantization of the base model. A 7B model that needs 28GB GPU RAM for full fine-tuning fits in 6–8GB. This opened up fine-tuning to single consumer GPUs (RTX 3090, 4090).

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",         # NormalFloat4 — better than int4 for neural weights
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
# Now apply LoRA on top of the quantized model
model = get_peft_model(model, lora_config)

QLoRA quality is close to full fine-tuning in most benchmarks. For production fine-tuning on a single GPU, QLoRA is the default starting point.

Instruction Fine-Tuning and RLHF

A base LLM (raw pretraining) completes text — it doesn’t follow instructions. To make it assistant-like:

Instruction fine-tuning (SFT — Supervised Fine-Tuning): Fine-tune on (instruction, response) pairs. The model learns to respond to user requests rather than just continuing text. LLaMA → LLaMA-Instruct via SFT on high-quality instruction datasets.
RLHF (Reinforcement Learning from Human Feedback): Collect human preference data (which response is better?). Train a reward model on preferences. Use PPO to optimize the LLM against the reward model. Aligns the model to human preferences. Used by ChatGPT, Claude, Gemini.
DPO (Direct Preference Optimization, 2023): Achieves RLHF-like alignment without training a separate reward model or running RL. Simpler, more stable. Increasingly preferred over PPO-based RLHF for production alignment.

Dataset Size Requirements

Task	Minimum examples	Notes
Prompt engineering	0–5	In-context examples
LoRA fine-tuning (style/format)	100–1,000	Quality matters more than quantity
LoRA fine-tuning (domain knowledge)	1,000–50,000	Depends on domain complexity
Full fine-tuning	10,000–100,000+	Less risk of overfitting at scale
Continued pretraining	Billions of tokens	Full corpus of domain documents

Evaluation

Fine-tuning without evaluation is reckless. Measure:

Task-specific metrics: ROUGE for summarization, BLEU for translation, F1 for extraction, accuracy for classification
General capability regression: Run standard benchmarks (MMLU, HellaSwag) before and after — ensure fine-tuning didn’t degrade general reasoning
Human evaluation: For open-ended generation, have annotators rate quality; automated metrics miss quality in ways humans catch

Common Interview Mistakes

Jumping to fine-tuning without considering if prompt engineering or RAG would solve the problem
Not mentioning LoRA — saying “fine-tune the model” without specifying the technique signals unfamiliarity with production practice
Ignoring catastrophic forgetting risk in full fine-tuning
Treating RLHF and SFT as interchangeable — they serve different purposes
Not considering inference cost: a fine-tuned 7B model at $0.001/call vs GPT-4 at $0.06/call matters at scale

How Transformer Models Work — the architecture being fine-tuned; understanding attention heads helps choose which modules to apply LoRA to
Overfitting and Regularization — catastrophic forgetting is a form of overfitting; low learning rate and mixed training data are the remedies
Cross-Validation Strategies — evaluating fine-tuned LLMs requires held-out test sets and benchmark regressions, not just validation loss
How to Choose Between ML Models — the decision of fine-tune vs RAG vs prompt engineering is the LLM-specific version of the model selection framework