“How do you evaluate an LLM?” is now a standard interview question at companies building AI products. It tests whether you understand that LLM evaluation is fundamentally harder than classical ML evaluation — there’s no single accuracy score, and what you measure depends entirely on what you’re building.
Why LLM Evaluation Is Hard
- No ground truth for open-ended generation: There are many correct answers to “explain gradient descent” — how do you measure correctness?
- Metrics game-ability: A model that scores well on MMLU may fail on your specific use case.
- Capability != alignment: A model might know the right answer but refuse to give it, or give it in a harmful way.
- Benchmark contamination: LLMs trained on web data may have seen benchmark test sets during pre-training.
Evaluation Taxonomy
1. Benchmark-Based Evaluation (Automated)
General capability benchmarks:
| Benchmark | What it measures | Limitation |
|---|---|---|
| MMLU (Massive Multitask Language Understanding) | 57 subjects: science, law, history, math | Multiple choice; susceptible to contamination |
| HumanEval / MBPP | Code generation: functional correctness of generated Python | Only tests Python; limited problem diversity |
| GSM8K / MATH | Grade school to competition math | Math is a narrow slice of reasoning |
| HellaSwag / WinoGrande | Commonsense reasoning | Models have largely saturated these benchmarks |
| TruthfulQA | Whether models give truthful answers vs plausible-sounding false ones | Static benchmark; adversarial examples can be constructed |
Task-specific automated metrics:
- BLEU / ROUGE: n-gram overlap with reference text. Useful for translation and summarization where a reference exists, but penalizes creative valid answers.
- BERTScore: Semantic similarity between generated and reference text using BERT embeddings. Better than BLEU for meaning preservation.
- Pass@k (coding): Probability that at least 1 of k generated solutions passes unit tests. Standard for code generation.
import numpy as np
def pass_at_k(n: int, c: int, k: int) -> float:
"""
Calculate pass@k: probability that at least 1 of k samples passes.
n: total number of samples generated
c: number of correct samples
k: number of samples to report on
Uses unbiased estimator from the HumanEval paper.
"""
if n - c < k:
return 1.0
return 1.0 - np.prod(
1.0 - k / np.arange(n - c + 1, n + 1)
)
# Example: generated 20 solutions, 7 pass unit tests, report pass@1 and pass@10
n, c = 20, 7
print(f"pass@1: {pass_at_k(n, c, 1):.3f}") # ~0.35
print(f"pass@10: {pass_at_k(n, c, 10):.3f}") # ~0.86
2. LLM-as-Judge Evaluation
Use a stronger LLM (GPT-4, Claude Opus) to evaluate outputs from the model under test. This scales to open-ended tasks where automated metrics fail.
from openai import OpenAI
def evaluate_with_llm_judge(
question: str,
response: str,
criteria: list[str],
judge_model: str = "gpt-4o"
) -> dict:
"""
Use LLM as a judge to evaluate response quality.
Returns scores and reasoning for each criterion.
"""
client = OpenAI()
criteria_text = "n".join(
f"{i+1}. {c}" for i, c in enumerate(criteria)
)
prompt = f"""You are an expert evaluator. Score the following response.
Question: {question}
Response: {response}
Evaluate on these criteria (score 1-5 each, where 5 is best):
{criteria_text}
Respond in JSON format:
{{
"scores": {{"criterion_name": score, ...}},
"reasoning": {{"criterion_name": "brief explanation", ...}},
"overall_score": float # weighted average
}}"""
completion = client.chat.completions.create(
model=judge_model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0 # Deterministic for reproducibility
)
import json
return json.loads(completion.choices[0].message.content)
# Usage
result = evaluate_with_llm_judge(
question="Explain the attention mechanism in transformers.",
response="...", # Model output to evaluate
criteria=[
"Accuracy: Is the explanation technically correct?",
"Clarity: Is it easy to understand for a senior engineer?",
"Completeness: Does it cover query/key/value and softmax?",
"Conciseness: Is it appropriately brief without being superficial?"
]
)
LLM-as-judge limitations:
- Position bias: judges favor the first option in pairwise comparisons
- Verbosity bias: longer answers are rated higher regardless of quality
- Self-enhancement bias: GPT-4 rates GPT-4 outputs higher
- Mitigation: swap order in pairwise eval; use calibration prompts; use multiple judges
3. Human Evaluation
For final validation and subjective quality dimensions:
Side-by-side ranking (SxS): Show annotators outputs from two models (A vs B, blinded). Ask which is better overall, and on specific dimensions. Compute win rate: model A wins X% of comparisons.
Absolute scoring: Rate on a 1-5 Likert scale. More interpretable but harder to agree on anchor points.
Red-teaming: Specialized annotators attempt to elicit harmful, biased, or incorrect outputs. Safety-critical evaluation.
Annotation quality:
- Use Fleiss’ kappa or inter-annotator agreement (IAA) to measure label consistency
- Kappa > 0.6 is acceptable; > 0.8 is strong
- For subjective tasks, low kappa may reflect genuine disagreement, not annotation error
4. RAGAS for RAG Systems
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Is the answer grounded in retrieved context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Are retrieved chunks actually useful?
context_recall # Did retrieval find all relevant information?
)
from datasets import Dataset
def evaluate_rag_system(questions, ground_truths, answers, contexts):
data = Dataset.from_dict({
"question": questions,
"ground_truth": ground_truths,
"answer": answers,
"contexts": contexts # List of retrieved chunks per question
})
result = evaluate(
dataset=data,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
return result.to_pandas()
Evaluation for Production LLM Applications
For production systems, automated benchmarks are not enough. Use:
- Golden test sets: 100-500 hand-curated examples with verified answers. Run before every deployment.
- Regression suites: Every reported bug gets a test case. Pass rate must not drop.
- Shadow evaluation: Log production queries; sample and evaluate offline; track quality metrics over time.
- User feedback signals: Thumbs up/down, correction behavior, abandonment rate as proxy quality metrics.
Red-Teaming and Safety Evaluation
- Jailbreak attempts: Does the model comply with harmful requests via prompt injection, role-playing, or indirect framing?
- Bias evaluation: Does the model exhibit demographic bias in recommendations, hiring, or risk assessments?
- Hallucination rate: On verifiable factual questions, how often does the model state false information confidently?
- Tools: Garak (LLM vulnerability scanner), PromptBench, AI Risk Assessment Framework (NIST AI RMF)
Depth Levels
Junior: Name several benchmarks (MMLU, HumanEval), explain why BLEU is insufficient for LLM evaluation.
Senior: Design an evaluation framework for a production RAG system. Explain LLM-as-judge trade-offs. Implement pass@k.
Staff: Design a continuous evaluation system that catches quality regressions across model versions, handles benchmark contamination, measures alignment vs capability separately, and integrates human evaluation at scale.
Related ML Topics
- What is RAG? — RAGAS evaluates RAG pipelines specifically: faithfulness, context precision, context recall — all metrics covered in this guide
- Fine-tuning LLMs vs Training from Scratch — evaluation is the gating step before and after fine-tuning; MMLU and task-specific benchmarks measure pre/post capability
- How Does RLHF Work? — RLHF training requires human evaluation at scale; reward model quality directly depends on annotation quality and inter-annotator agreement
- How Transformer Models Work — understanding architecture helps interpret benchmark results: MMLU tests knowledge stored in weights, HumanEval tests reasoning in context
- AI Ethics and Fairness — red-teaming and bias evaluation are part of the LLM evaluation framework; TruthfulQA and fairness audits address safety alongside capability