Eval-driven development for agent behavior

Deterministic code is verified with ordinary unit tests. Agent and LLM behavior is probabilistic — the same input can yield different valid (or invalid) outputs across runs — so it requires an eval harness that scores behavior over many trials. Treat evals as the test suite for the non-deterministic parts of the system.

When evals apply (and when they do not)

  • MUST use ordinary deterministic tests for parsers, schema validation, tool-call argument construction, retries, and any logic with a fixed correct output. Do not wrap deterministic code in an LLM judge.
  • MUST use an eval harness when the artifact under test is an agent's decisions, an LLM's generated text, multi-step tool use, or RAG answer quality — outputs that vary run to run.
  • SHOULD isolate the deterministic and probabilistic layers so each is tested with the cheaper appropriate method.

Choosing a grader

Grader type Strengths Limits Use when
Code-based (assertion, regex, exact/structured match) Fast, cheap, reproducible, easy to debug Brittle to valid phrasing variations Output is objectively checkable
Model-based (LLM-as-judge) Flexible, scalable, captures nuance, handles open-ended tasks Non-deterministic, costs tokens, can be biased Output is open-ended or subjective
  • MUST prefer a code-based grader whenever the success criterion can be expressed in code.
  • SHOULD grade each quality dimension (e.g., groundedness, coverage, tone) with a separate, isolated LLM-judge rather than one judge scoring everything at once.
  • SHOULD use a judge model deliberately different from the model under test to reduce self-preference bias (a documented but still-active research concern, not a settled magnitude).

Calibrating an LLM judge

A model-based grader MUST NOT be allowed to gate (block a merge, fail a build, accept a release) until it has been calibrated against a human-labeled gold set.

  • MUST assemble a human-labeled gold set, run the judge over it, and measure agreement (e.g., divergence rate or correlation) before trusting the judge.
  • MUST re-calibrate when the judge prompt, the judge model, or the rubric changes.
  • SHOULD instruct the judge to return "Unknown" or abstain when it lacks evidence, rather than guessing.
  • SHOULD periodically read raw transcripts and grades; a passing aggregate score can hide a judge that is rejecting valid answers or rubber-stamping bad ones.

Reliability over multiple runs

A single passing run proves nothing about probabilistic behavior.

  • MUST report results across multiple runs of the same input, not one pass.
  • SHOULD distinguish "succeeds at least once" from "succeeds every time" — these are different product guarantees. Anthropic's eval guidance frames these as pass@k (success in at least one of k attempts) and pass^k (all k trials succeed); pick the metric that matches your reliability requirement.
  • SHOULD track the consistency metric over time so regressions in flakiness are visible, not just regressions in best-case quality.

Holding out a test set

  • MUST keep a held-out test set that is not used while iterating on prompts, rubrics, or the judge — otherwise you are tuning to the eval and overstating quality.
  • SHOULD treat the gold set used for judge calibration and the held-out behavior test set as distinct; do not let one leak into the other.
  • SHOULD version eval datasets alongside the code so a result is reproducible against a known dataset revision.
version
1.0.0
tags
evals, ai, testing
author
Mike Fullerton
modified
2026-06-09

Change History

Version Date Author Summary
1.0.0 2026-06-09 Mike Fullerton Initial creation