“Post-training” — the work of taking a pretrained model and aligning it for instruction-following, helpfulness, and safety — has emerged as one of the most differentiating specialties in 2026 AI labs. Frontier model quality is increasingly determined by post-training, not just scale. Interviews for post-training roles probe a different set of topics than ML systems or applied AI. This guide covers what is actually asked.
The post-training pipeline at a glance
- Pretraining (next-token prediction on web-scale text)
- Supervised fine-tuning (SFT) on instruction-response pairs
- Preference optimization (RLHF, DPO, IPO, or variants)
- Iterative refinement with human feedback, AI feedback, or both
- Specialized capability training (reasoning, coding, safety)
- Constitutional / safety-specific training
Who hires for this
- Frontier AI labs: OpenAI, Anthropic, Google DeepMind, Meta — large post-training teams
- Specialty model companies: Mistral, Cohere, AI21
- Open-source orgs: HuggingFace TRL, Allen AI
- Some enterprise teams (Glean, Harvey) doing per-customer post-training
Core concepts to know
Reinforcement learning from human feedback (RLHF)
- Three-stage pipeline: SFT → reward model → PPO with the reward model
- Reward hacking: when the policy exploits the reward model rather than the underlying preference
- KL penalty against the reference model: prevents drift
- PPO algorithmic details: clipping, GAE, value heads
Direct preference optimization (DPO)
- Skips the explicit reward model; optimizes directly on preference pairs
- Mathematically equivalent under specific assumptions; often empirically simpler
- IPO, KTO, ORPO, SimPO are variants worth knowing
- Tradeoff: DPO requires high-quality preference data; PPO can adapt to noisier signals
Constitutional AI (Anthropic)
- AI critiques and revises its own outputs against a set of principles
- Reduces dependence on human-labeled safety data
- Iterative: revise, score, learn
RLAIF (RL from AI Feedback)
- Replace human raters with a strong AI rater
- Cheaper; can scale; requires care to avoid bias compounding
Reasoning model post-training
- OpenAI o1/o3, DeepSeek R1, Anthropic’s extended-thinking work
- RL on verifiable tasks (math, code) where correctness can be checked
- Test-time compute scaling — pay more compute per query for better reasoning
Common interview questions
- “Walk me through the RLHF pipeline. Where does it usually fail?”
- “DPO vs PPO. When would you pick which?”
- “What is reward hacking? Give an example you have seen or read about.”
- “How do you collect high-quality preference data?”
- “How do you evaluate a post-trained model?”
- “Design a post-training pipeline for a coding model.”
- “Your model started refusing benign requests after a safety-training round. What do you investigate?”
Coding rounds
- Implement a simplified DPO loss
- Implement PPO’s clipping objective
- Implement a preference-data sampler with rejection or annealing
- Build a small reward-model evaluation harness
Code is typically Python with PyTorch fluency expected. HuggingFace TRL is the standard library; familiarity helps.
System design
- “Design a preference-data collection platform.”
- “Design an iterative post-training loop with human-in-the-loop review.”
- “Design an evaluation harness for safety-trained models.”
- “Design distributed PPO training over thousands of GPUs.”
Skills to brush up
- PyTorch, HuggingFace Transformers + TRL deeply
- Distributed training (FSDP, DeepSpeed, ZeRO)
- Statistics and experimental design — running offline evals correctly
- Reading papers — DPO, PPO, KTO, RLHF foundations
- Familiarity with reasoning-model training (o1/R1 family)
Compensation
Post-training engineers at frontier labs are paid in the senior-research / staff-IC band — $400K–$1M+ total comp; researchers with strong publication records can hit higher. Open-source post-training maintainers often have outsized name recognition and good options for retention packages.
How to break in
- Train a small model end-to-end (Llama 3 8B + DPO on a public dataset) and write up methodology
- Contribute to TRL or a similar open-source library
- Read and reproduce key papers (DPO, ORPO, Constitutional AI, RLAIF)
- Build a public eval set for some niche capability and demonstrate post-training improvements
Frequently Asked Questions
Do I need a PhD?
Not strictly, but the field is research-heavy and PhD background is common at the senior research level. For research engineer or post-training engineer roles (vs research scientist), strong publication or OSS track record can substitute.
How is this different from “ML engineer”?
ML engineer is broad — could be ranking, recsys, classic ML at scale. Post-training is specifically about aligning generative LLMs. Different math, different tooling, different evaluation methodology.
Will this specialty consolidate?
Unlikely soon. Post-training is differentiating product quality at frontier labs; the talent pool is small; the techniques are evolving rapidly (e.g., reasoning RL is recent). Specialty looks durable for the next few years.