“Eval engineer” went from a niche role at AI labs in 2023 to a named specialty at most AI-shipping companies by 2026. The job: build the test sets, automated graders, and continuous evaluation infrastructure that lets a team know whether their AI feature got better or worse this week. The interview process is distinct from both ML engineering and traditional SDE, and worth understanding if you want to work in this space.
What the role actually does
- Designs evaluation datasets that match real user behavior
- Implements automated graders (rule-based, model-based, human-in-the-loop)
- Builds the eval infrastructure: harnesses, dashboards, regression detection
- Owns the bar for shipping new prompts, models, or features
- Partners with research, product, and applied engineering
The companies hiring
- AI labs: OpenAI, Anthropic, Google DeepMind, Meta — all have dedicated eval teams
- AI-shipping product companies: Notion, Linear, Cursor, Replit, GitHub, Stripe — eval roles inside their AI teams
- Eval-tooling startups: Braintrust, Humanloop, LangSmith (LangChain), Patronus, others
- Enterprise AI: Gradient, Scale AI, Surge AI
The interview process
- Recruiter screen — typical, plus probing of your eval understanding
- Technical phone screen: a coding problem (Python) plus eval-specific design (e.g., “design an eval for X”)
- Take-home (2–4 hours): build a small eval harness for a given task, often LLM grading included
- Onsite virtual: 2 coding, 1 eval system design, 1 eval data design, 1 behavioral
- Cycle time: 3–5 weeks
Eval system design — the unique round
Common prompts:
- “Design an eval for a customer-support chatbot.”
- “Design an eval for a code-completion model.”
- “Design an eval for a RAG system answering medical questions.”
What interviewers want to hear:
- Distinct dataset slices: typical, edge cases, adversarial, regression cases from past failures
- Metrics that match user value, not vanity metrics (BLEU on chat is meaningless)
- Grading methodology: rule-based, model-graded, human-graded, hybrid
- How you would detect regressions across model versions
- How you would handle non-determinism and statistical significance
- How you would catch data contamination and label leakage
Eval data design — the second unique round
Common prompts:
- “You have 100K user interactions. How do you build a 500-item eval set from them?”
- “Your eval shows 95% pass rate but users complain. What is wrong?”
- “You need to evaluate a new model release this week. How do you decide if it is better?”
Strong answers discuss stratified sampling, importance weighting (the long-tail edge cases matter most), regression cases, and the difference between aggregate metrics and segment-level metrics.
Coding rounds
Often Python-heavy with a leaning toward:
- Data manipulation (pandas, SQL)
- API design (the eval harness)
- Statistics (confidence intervals, significance testing)
- Async patterns (running many evals in parallel)
Less LeetCode-heavy than ML engineer interviews; more practical engineering with a stats and eval flavor.
Skills you should have
- Strong Python and SQL
- Comfort with statistical reasoning (sampling, error bars, A/B testing)
- Familiarity with at least one LLM eval framework (OpenAI Evals, lm-eval-harness, Braintrust, LangSmith)
- Solid understanding of the difference between offline eval, online eval, and red-teaming
- Some ML literacy (you do not need to train models, but you need to read papers)
Compensation
Eval engineering at AI labs (Anthropic, OpenAI, DeepMind) pays in the senior-engineer band — competitive with applied research engineering, often slightly below frontier-research roles. At AI-shipping product companies the comp matches that company’s senior SDE band. At eval-tooling startups the comp is mid-tier with significant equity upside.
How to break in
- Build a public eval harness for a task you understand. Publish on GitHub with clear methodology.
- Contribute to lm-eval-harness or HELM
- Read the Anthropic and OpenAI eval-related blog posts; the methodology is increasingly in the open
- Take a take-home seriously; eval take-homes are excellent showcases for the role
Frequently Asked Questions
Is this a research role or an engineering role?
Engineering with strong methodology. You ship infrastructure and datasets, not papers. Some labs blend the two; most AI-shipping companies treat it as engineering.
Do I need an ML PhD?
No. Most eval engineers come from SDE backgrounds with strong stats fluency. Reading papers is sufficient ML; training is not required for most roles.
Is this role going to disappear as evals get better?
Unlikely. As AI features expand, the eval surface grows, and the methodology gets more complex (multi-turn, agentic, multi-modal). Eval engineering looks like a durable specialty.