How does RLHF align LLMs like ChatGPT and Claude?

RLHF has three stages: (1) Supervised fine-tuning (SFT): fine-tune the pre-trained LLM on human-written demonstrations of ideal assistant behavior. Creates a good starting point. (2) Reward model training: show two model responses to the same prompt, have humans choose which is better. Train a reward model R(prompt, response) to predict human preferences. (3) RL fine-tuning with PPO: treat the LLM as an RL agent. State = prompt. Action = generate response. Reward = R(prompt, response). PPO updates the LLM to maximize reward with a KL penalty preventing divergence from the SFT model (prevents reward hacking). This is how ChatGPT, Claude, and Gemini learn to follow instructions, refuse harmful requests, and produce helpful responses. Alternative: DPO (Direct Preference Optimization) skips the reward model, directly optimizing from preference data. Simpler pipeline, competitive results, increasingly popular.

What is the multi-armed bandit problem and where is it used?

The simplest RL problem: repeatedly choose among K options (arms) with unknown reward distributions, balancing exploration (trying arms to learn their rewards) and exploitation (choosing the best-known arm). Applications: A/B testing (bandit algorithms shift traffic to the winning variant faster than fixed splits), ad selection (which ad to show?), and content recommendation (which article to surface?). Algorithms: epsilon-greedy (random arm with probability epsilon, best arm otherwise -- simple but wastes exploration on bad arms), UCB (choose arm with highest: estimated_reward + confidence_bonus for under-explored arms), Thompson Sampling (sample from Bayesian posterior per arm, choose highest sample -- naturally balances exploration via uncertainty). Thompson Sampling is recommended for most practical applications: simple, Bayesian-optimal, and strong empirical performance.

AI/ML Interview: Reinforcement Learning — Q-Learning, Policy Gradient, Reward Shaping, RLHF, Multi-Armed Bandit

⏱ 6 min read

Reinforcement learning (RL) trains agents to make sequential decisions by maximizing cumulative reward. RL powers game-playing AI (AlphaGo, OpenAI Five), robotics, recommendation systems, and — most relevantly — RLHF (Reinforcement Learning from Human Feedback) which aligns LLMs like ChatGPT and Claude. Understanding RL fundamentals is increasingly tested in ML interviews, especially at companies building AI agents and LLM products.

RL Fundamentals

The RL framework: an agent interacts with an environment. At each timestep: the agent observes the state s, takes an action a, receives a reward r, and transitions to a new state s_next. The goal: learn a policy (state -> action mapping) that maximizes the expected cumulative discounted reward: E[sum(gamma^t * r_t)] where gamma (0.9-0.99) discounts future rewards. Key concepts: (1) State — the current situation (board position in chess, current portfolio in trading, conversation history in chatbot). (2) Action — what the agent can do (move a piece, buy/sell, generate a token). (3) Reward — immediate feedback signal (win/lose, profit/loss, human preference score). (4) Value function V(s) — expected cumulative reward from state s following the optimal policy. “How good is this state?” (5) Q-function Q(s,a) — expected cumulative reward from taking action a in state s. “How good is this action in this state?” (6) Policy pi(s) — the agent decision-making strategy. Maps states to actions (or probability distributions over actions). The exploration-exploitation tradeoff: the agent must explore (try new actions to discover better strategies) and exploit (use known good actions to maximize reward). Too much exploration wastes time on bad actions. Too much exploitation misses better strategies.

Q-Learning and Deep Q-Networks

Q-Learning: learn the Q-function (state-action value) without a model of the environment. Update rule: Q(s,a) = Q(s,a) + alpha * (r + gamma * max_a Q(s_next, a) – Q(s,a)). The agent updates its estimate of Q based on the observed reward and the best estimated future value. Over many episodes, Q converges to the optimal values. The policy: always take the action with the highest Q value (greedy). With epsilon-greedy exploration: take a random action with probability epsilon (explore), otherwise take the best action (exploit). Deep Q-Network (DQN, DeepMind 2015): for large state spaces (Atari games: screen pixels), a neural network approximates Q(s,a). Input: state (image). Output: Q-values for each possible action. Innovations: (1) Experience replay — store past experiences (s, a, r, s_next) in a buffer. Sample random mini-batches for training. This breaks correlation between consecutive samples and reuses rare experiences. (2) Target network — use a separate, slowly updated network for the target Q value. Prevents the moving target problem (the Q value we are trying to reach keeps changing as the network updates). DQN achieved superhuman performance on 49 Atari games from raw pixels. Limitation: DQN works for discrete action spaces (a finite set of actions). For continuous actions (robot joint angles, steering angle): use policy gradient methods.

Policy Gradient and PPO

Policy gradient methods directly optimize the policy (instead of learning Q-values and deriving the policy). The policy pi_theta(a|s) is a neural network parameterized by theta that outputs a probability distribution over actions given the state. REINFORCE: the simplest policy gradient. Update: theta = theta + alpha * gradient(log pi_theta(a|s)) * G, where G is the cumulative reward from this trajectory. Intuition: increase the probability of actions that led to high reward; decrease for low reward. Problem: high variance (individual trajectories are noisy). Baseline subtraction: subtract a baseline (typically V(s)) from G to reduce variance. Advantage A(s,a) = Q(s,a) – V(s) = “how much better is this action compared to the average action in this state.” Actor-Critic: two networks — the actor (policy network) chooses actions, the critic (value network) estimates V(s). The critic provides the baseline for variance reduction. PPO (Proximal Policy Optimization, OpenAI 2017): the standard RL algorithm. Clips the policy update to prevent large, destabilizing changes. The clipped objective: min(ratio * advantage, clip(ratio, 1-epsilon, 1+epsilon) * advantage). This ensures each update changes the policy by at most epsilon (typically 0.1-0.2). PPO is stable, sample-efficient, and easy to tune. Used for: RLHF (aligning LLMs), robotics, game AI, and most modern RL applications.

RLHF: Aligning LLMs

RLHF (Reinforcement Learning from Human Feedback) fine-tunes LLMs to be helpful, harmless, and honest. Three stages: (1) Supervised fine-tuning (SFT) — fine-tune the pre-trained LLM on human-written demonstrations of ideal assistant behavior. This creates a good starting point. (2) Reward model training — collect human preferences: show two model responses to the same prompt, have a human choose which is better. Train a reward model R(prompt, response) to predict human preferences. The reward model learns what humans consider a “good” response. (3) RL fine-tuning with PPO — treat the LLM as the RL agent. State = the prompt. Action = generate a response. Reward = R(prompt, response) from the reward model. Use PPO to update the LLM to maximize the reward, with a KL penalty to prevent the model from diverging too far from the SFT model (prevents reward hacking — finding degenerate responses that fool the reward model). RLHF is how ChatGPT, Claude, and Gemini are trained to follow instructions, refuse harmful requests, and produce helpful responses. Alternatives to RLHF: DPO (Direct Preference Optimization) — directly optimizes the language model from preference data without training a separate reward model. Simpler pipeline, competitive results. Increasingly popular as an RLHF alternative.

Multi-Armed Bandit

The multi-armed bandit is the simplest RL problem: no state transitions, just repeated action selection. Each “arm” (action) has an unknown reward distribution. The agent must balance exploring different arms and exploiting the best-known arm. Applications: (1) A/B testing — each variant is an arm. Instead of fixed 50/50 split, a bandit algorithm gradually shifts traffic to the winning variant. Faster convergence than fixed-sample tests. (2) Ad selection — which ad to show to a user? Each ad is an arm with unknown click-through rate. (3) Recommendation — which content to recommend? Balance showing known-good content (exploitation) with trying new content (exploration). Algorithms: (1) Epsilon-greedy — with probability epsilon, choose a random arm. Otherwise, choose the arm with the highest estimated reward. Simple but wastes exploration on clearly bad arms. (2) UCB (Upper Confidence Bound) — choose the arm with the highest: estimated_reward + c * sqrt(log(total_pulls) / arm_pulls). The second term is large for under-explored arms, automatically balancing exploration. (3) Thompson Sampling — maintain a Bayesian posterior for each arm reward. Sample from each posterior. Choose the arm with the highest sample. Natural exploration: uncertain arms have wide posteriors, sometimes producing high samples. As data accumulates: posteriors narrow and the best arm dominates. Thompson Sampling is the recommended approach for most practical applications (ad serving, recommendation) — it is simple, Bayesian-optimal, and performs well empirically.