Tool-call evaluation

When an agent has tools (function calling, MCP servers), correctness is not just the final answer — it is how the agent got there. You MUST evaluate whether the agent selected the right tool, supplied correct arguments, called tools in the right order, and stopped when the task was done.

What to score

A tool-call eval SHOULD report each dimension separately so failures are diagnosable:

Dimension	Question	Signal
Tool selection	Did it call the correct tool (and avoid wrong/no-op calls)?	accuracy, false-call rate
Argument correctness	Are argument names, types, and values right?	exact/semantic match
Ordering	Were dependent calls made in a valid sequence?	trajectory match
Stopping	Did it stop instead of looping or over-calling?	extra-call count
Task completion	Did the end-to-end task succeed?	pass/fail

Build a trajectory suite

Each test case MUST pin an input and the expected tool-call trajectory (tool names plus expected arguments). Treat this suite as a fixed, versioned regression set.
Argument checks SHOULD allow semantic equivalence where exact match is too strict (e.g., equivalent date formats), but MUST stay strict on identifiers, units, and destructive parameters.
Score against the trajectory, not just the final output — an agent can reach a correct answer through an unsafe or wasteful path.

Error-recovery cases

The suite MUST include cases where a tool returns an error, an empty result, or a timeout, and assert the agent recovers (retries sensibly, picks a fallback, or reports failure) rather than hallucinating success.
Include cases where the correct action is to call no tool, to confirm the agent does not invoke tools spuriously.

Test long compositions explicitly

Tool-calling accuracy degrades as chain length and the number of available tools grow (a pattern visible across function-calling benchmarks such as the Berkeley Function-Calling Leaderboard). You MUST NOT assume short-chain pass rates generalize to long chains.
Add multi-step cases that exercise the realistic tool count the agent ships with, and report accuracy as a function of chain length.

Practices

Run each case multiple times; tool selection is non-deterministic, so SHOULD report pass rates with trial counts, not a single run.
Gate releases on the suite and track per-dimension scores over time so a regression in argument correctness is not masked by stable task-completion numbers.
Keep cases hermetic: stub or record tool responses so the eval measures the agent, not live backend flakiness.

Change History

Version	Date	Author	Summary
1.0.0	2026-06-09	Mike Fullerton	Initial creation