How to Evaluate an LLM: Benchmarks, Human Eval, and Red-Teaming

“How do you evaluate an LLM?” is now a standard interview question at companies building AI products. It tests whether you understand that LLM evaluation is fundamentally harder than classical ML evaluation — there’s no single accuracy score, and what you measure depends entirely on what you’re building.

Why LLM Evaluation Is Hard

No ground truth for open-ended generation: There are many correct answers to “explain gradient descent” — how do you measure correctness?
Metrics game-ability: A model that scores well on MMLU may fail on your specific use case.
Capability != alignment: A model might know the right answer but refuse to give it, or give it in a harmful way.
Benchmark contamination: LLMs trained on web data may have seen benchmark test sets during pre-training.

Evaluation Taxonomy

1. Benchmark-Based Evaluation (Automated)

General capability benchmarks:

Benchmark	What it measures	Limitation
MMLU (Massive Multitask Language Understanding)	57 subjects: science, law, history, math	Multiple choice; susceptible to contamination
HumanEval / MBPP	Code generation: functional correctness of generated Python	Only tests Python; limited problem diversity
GSM8K / MATH	Grade school to competition math	Math is a narrow slice of reasoning
HellaSwag / WinoGrande	Commonsense reasoning	Models have largely saturated these benchmarks
TruthfulQA	Whether models give truthful answers vs plausible-sounding false ones	Static benchmark; adversarial examples can be constructed

Task-specific automated metrics:

BLEU / ROUGE: n-gram overlap with reference text. Useful for translation and summarization where a reference exists, but penalizes creative valid answers.
BERTScore: Semantic similarity between generated and reference text using BERT embeddings. Better than BLEU for meaning preservation.
Pass@k (coding): Probability that at least 1 of k generated solutions passes unit tests. Standard for code generation.

import numpy as np

def pass_at_k(n: int, c: int, k: int) -> float:
    """
    Calculate pass@k: probability that at least 1 of k samples passes.

    n: total number of samples generated
    c: number of correct samples
    k: number of samples to report on

    Uses unbiased estimator from the HumanEval paper.
    """
    if n - c < k:
        return 1.0
    return 1.0 - np.prod(
        1.0 - k / np.arange(n - c + 1, n + 1)
    )

# Example: generated 20 solutions, 7 pass unit tests, report pass@1 and pass@10
n, c = 20, 7
print(f"pass@1: {pass_at_k(n, c, 1):.3f}")   # ~0.35
print(f"pass@10: {pass_at_k(n, c, 10):.3f}") # ~0.86

2. LLM-as-Judge Evaluation

Use a stronger LLM (GPT-4, Claude Opus) to evaluate outputs from the model under test. This scales to open-ended tasks where automated metrics fail.

from openai import OpenAI

def evaluate_with_llm_judge(
    question: str,
    response: str,
    criteria: list[str],
    judge_model: str = "gpt-4o"
) -> dict:
    """
    Use LLM as a judge to evaluate response quality.
    Returns scores and reasoning for each criterion.
    """
    client = OpenAI()

    criteria_text = "n".join(
        f"{i+1}. {c}" for i, c in enumerate(criteria)
    )

    prompt = f"""You are an expert evaluator. Score the following response.

Question: {question}

Response: {response}

Evaluate on these criteria (score 1-5 each, where 5 is best):
{criteria_text}

Respond in JSON format:
{{
  "scores": {{"criterion_name": score, ...}},
  "reasoning": {{"criterion_name": "brief explanation", ...}},
  "overall_score": float  # weighted average
}}"""

    completion = client.chat.completions.create(
        model=judge_model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0  # Deterministic for reproducibility
    )

    import json
    return json.loads(completion.choices[0].message.content)

# Usage
result = evaluate_with_llm_judge(
    question="Explain the attention mechanism in transformers.",
    response="...",  # Model output to evaluate
    criteria=[
        "Accuracy: Is the explanation technically correct?",
        "Clarity: Is it easy to understand for a senior engineer?",
        "Completeness: Does it cover query/key/value and softmax?",
        "Conciseness: Is it appropriately brief without being superficial?"
    ]
)

LLM-as-judge limitations:

Position bias: judges favor the first option in pairwise comparisons
Verbosity bias: longer answers are rated higher regardless of quality
Self-enhancement bias: GPT-4 rates GPT-4 outputs higher
Mitigation: swap order in pairwise eval; use calibration prompts; use multiple judges

3. Human Evaluation

For final validation and subjective quality dimensions:

Side-by-side ranking (SxS): Show annotators outputs from two models (A vs B, blinded). Ask which is better overall, and on specific dimensions. Compute win rate: model A wins X% of comparisons.

Absolute scoring: Rate on a 1-5 Likert scale. More interpretable but harder to agree on anchor points.

Red-teaming: Specialized annotators attempt to elicit harmful, biased, or incorrect outputs. Safety-critical evaluation.

Annotation quality:

Use Fleiss’ kappa or inter-annotator agreement (IAA) to measure label consistency
Kappa > 0.6 is acceptable; > 0.8 is strong
For subjective tasks, low kappa may reflect genuine disagreement, not annotation error

4. RAGAS for RAG Systems

from ragas import evaluate
from ragas.metrics import (
    faithfulness,       # Is the answer grounded in retrieved context?
    answer_relevancy,   # Is the answer relevant to the question?
    context_precision,  # Are retrieved chunks actually useful?
    context_recall      # Did retrieval find all relevant information?
)
from datasets import Dataset

def evaluate_rag_system(questions, ground_truths, answers, contexts):
    data = Dataset.from_dict({
        "question": questions,
        "ground_truth": ground_truths,
        "answer": answers,
        "contexts": contexts  # List of retrieved chunks per question
    })

    result = evaluate(
        dataset=data,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )

    return result.to_pandas()

Evaluation for Production LLM Applications

For production systems, automated benchmarks are not enough. Use:

Golden test sets: 100-500 hand-curated examples with verified answers. Run before every deployment.
Regression suites: Every reported bug gets a test case. Pass rate must not drop.
Shadow evaluation: Log production queries; sample and evaluate offline; track quality metrics over time.
User feedback signals: Thumbs up/down, correction behavior, abandonment rate as proxy quality metrics.

Red-Teaming and Safety Evaluation

Jailbreak attempts: Does the model comply with harmful requests via prompt injection, role-playing, or indirect framing?
Bias evaluation: Does the model exhibit demographic bias in recommendations, hiring, or risk assessments?
Hallucination rate: On verifiable factual questions, how often does the model state false information confidently?
Tools: Garak (LLM vulnerability scanner), PromptBench, AI Risk Assessment Framework (NIST AI RMF)

Depth Levels

Junior: Name several benchmarks (MMLU, HumanEval), explain why BLEU is insufficient for LLM evaluation.

Senior: Design an evaluation framework for a production RAG system. Explain LLM-as-judge trade-offs. Implement pass@k.

Staff: Design a continuous evaluation system that catches quality regressions across model versions, handles benchmark contamination, measures alignment vs capability separately, and integrates human evaluation at scale.