How Does RLHF Work? Reinforcement Learning from Human Feedback Explained

⏱ 5 min read

RLHF (Reinforcement Learning from Human Feedback) is the technique that transforms a raw language model into an assistant — the difference between a model that completes text and one that helpfully answers questions, refuses harmful requests, and matches human preferences. Understanding RLHF is now expected in AI/ML interviews at companies building or integrating LLMs.

What the Interviewer Is Testing

Can you explain the three-phase RLHF pipeline clearly? Do you understand why PPO is used, what reward hacking means, and how DPO simplifies the process? Can you distinguish between RLHF and related techniques like Constitutional AI and RLAIF?

Why Raw LLMs Need Alignment

A base LLM (pretrained only) is trained to predict the next token in text from the internet. Given “How do I make a bomb?”, the most likely completion from internet text is a detailed answer — not a refusal. The model optimizes for text prediction, not for being safe, helpful, or honest.

Alignment techniques fine-tune the model on human preferences — responses that humans rate as better. RLHF is the dominant approach used by OpenAI (ChatGPT), Anthropic (Claude), and Google (Gemini).

Phase 1: Supervised Fine-Tuning (SFT)

Start with a pretrained base model. Fine-tune it on a curated dataset of (prompt, ideal response) pairs written by human contractors. This teaches the model the format of being an assistant — answering questions rather than continuing text, being concise, following instructions.

Dataset format:
{
    "prompt": "Explain gradient descent to a 10-year-old.",
    "response": "Imagine you're hiking and you want to reach the lowest valley. ..."
}

# Standard supervised fine-tuning on this dataset
# Cross-entropy loss: predict each token of the ideal response
# 10,000–100,000 high-quality examples typical

SFT alone produces a good assistant, but it’s limited by the quality and coverage of the demonstration data. The model can’t generalize beyond what it was shown.

Phase 2: Reward Model Training

A reward model (RM) learns to predict human preference scores for model responses. Human raters compare pairs of responses to the same prompt: “Which response is better, A or B?” The reward model trains on these comparisons.

Comparison data:
{
    "prompt": "Write a Python function to reverse a string.",
    "chosen":   "def reverse(s): return s[::-1]",        # human preferred this
    "rejected": "def reverse(s):n    result = ''n    for c in s:n        result = c + resultn    return result"
}

# Reward model: same architecture as the SFT model but with a scalar head
# Loss: log(sigmoid(reward(chosen) - reward(rejected)))
# Bradley-Terry model: human preference as a probability over the score gap

The reward model learns a scalar score representing “how much would a human prefer this response?” Higher scores = better. This score will be used to guide RL training.

from transformers import AutoModelForSequenceClassification

reward_model = AutoModelForSequenceClassification.from_pretrained(
    "sft_checkpoint",
    num_labels=1  # single scalar reward output
)
# Fine-tune on (prompt, chosen, rejected) triplets

Phase 3: RL Fine-Tuning with PPO

Now use the reward model as a training signal to further improve the SFT model using reinforcement learning. The policy (the SFT model) generates responses, the reward model scores them, and PPO updates the policy to generate higher-scoring responses.

RLHF training loop:
1. Sample a prompt from the dataset
2. Policy (SFT model) generates a response
3. Reward model scores the response → r
4. KL penalty: subtract β × KL(policy || reference_policy)
   (prevents the model from drifting too far from the SFT checkpoint)
5. Total reward = r - β × KL
6. PPO updates the policy to maximize expected total reward

# KL penalty is critical — without it, the model "reward hacks"
# (generates high-scoring gibberish that tricks the reward model)

The KL (Kullback-Leibler divergence) penalty keeps the RLHF model close to the SFT model. Without it, the policy learns to exploit weaknesses in the reward model — outputting confident-sounding nonsense that scores high but is actually wrong. β controls the trade-off: too low → reward hacking, too high → RL has no effect.

Reward Hacking and Overoptimization

A key empirical finding (Gao et al., 2022): as RLHF training progresses, the reward model score keeps improving but gold-standard human preference scores eventually decline. The model has learned to game the reward model specifically, not to actually be more helpful.

Mitigations:

KL penalty (as above)
Periodically refresh the reward model with new human comparisons
Use an ensemble of reward models — harder to simultaneously game all of them
Constitutional AI (see below) reduces reliance on a single learned reward signal

DPO: The Simpler Alternative

Direct Preference Optimization (Rafailov et al., 2023) achieves the same goals as RLHF without training a separate reward model or running RL. DPO derives the optimal policy directly from the preference data using a clever reparametrization.

from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    beta=0.1,          # KL penalty coefficient (same role as in RLHF)
    learning_rate=1e-6,
    num_train_epochs=1,
)

trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model_reference,   # frozen reference; computes KL
    args=config,
    train_dataset=preference_dataset, # {prompt, chosen, rejected}
    tokenizer=tokenizer,
)
trainer.train()

DPO advantages over PPO-based RLHF:

No reward model to train and maintain
No online generation during training (more stable)
Simpler hyperparameter space
Faster to iterate

Most open-source fine-tuning (Llama fine-tunes, Mistral instruct variants) uses DPO rather than PPO. PPO is still used at scale by frontier model labs where the extra complexity is worth the performance ceiling.

Constitutional AI (Anthropic’s Approach)

RLHF requires expensive human labelers for every comparison. Anthropic’s Constitutional AI (CAI) reduces the human labor requirement:

Define a “constitution” — a set of principles (“be helpful, harmless, and honest; avoid deception; do not assist with bioweapons”)
Use the model itself to evaluate its own responses against the constitution (RLAIF — RL from AI Feedback instead of human feedback)
The AI-generated preference labels supplement or replace human labels for many categories

This scales better than pure RLHF and reduces dependence on human raters for difficult edge cases. Human feedback is focused on the hardest cases where AI self-evaluation is unreliable.

Common Interview Mistakes

Confusing SFT and RLHF — they’re sequential phases, not alternatives
Not explaining why the KL penalty is necessary (reward hacking is the key concept)
Treating DPO and RLHF as unrelated — DPO is a mathematically equivalent alternative, not a completely different alignment approach
Forgetting that RLHF requires a pretrained model — you can’t do RLHF from scratch
Not knowing Constitutional AI / RLAIF as a recent development expected in 2026 interviews

Fine-tuning LLMs vs Training from Scratch — RLHF builds on SFT (supervised fine-tuning); DPO is a single-stage alternative that achieves similar alignment without a separate reward model
How Transformer Models Work — the policy model and reward model are both transformers; understanding attention helps diagnose reward hacking in specific layers
What is RAG? — RAG and RLHF are complementary alignment techniques: RAG grounds factual accuracy, RLHF aligns tone, safety, and helpfulness
Overfitting and Regularization — reward model overfitting to annotator quirks is a core RLHF failure mode; KL penalty acts as a regularizer against policy drift
Classification Metrics — reward model evaluation uses pairwise accuracy (Bradley-Terry model); same precision/recall trade-offs apply to the preference classifier

See also: How to Evaluate an LLM — reward model quality is measured by pairwise preference accuracy; overall RLHF alignment is validated with human evaluation and safety red-teaming.