Skip to content

Lesson 9: evaluating and testing agents - how to know if your agent works

You have built an agent. It has tools, memory, and a planning loop. It can retrieve information, take actions, and converse with users. But how do you know if it actually works well?

Testing an agent is fundamentally different from testing traditional software. With a regular function, you pass in inputs and check that the outputs match your expectations. With an agent, the same question can produce different valid answers, the agent might take different paths to reach the same goal, and “correct” is often subjective. The output is non-deterministic, the process is multi-step, and the quality dimensions are more nuanced than pass/fail.

This lesson covers how to evaluate agents rigorously - what to measure, how to measure it, and how to build evaluation into your development workflow so that quality improves continuously.

Ask a traditional function to add 2 + 2 and it always returns 4. Ask an agent “What should I do about our declining user retention?” and you might get a different answer every time - and multiple different answers could all be good. Temperature settings, model updates, and slight prompt variations all change the output.

An agent that books a flight might search by price first, then by time. Or it might search by time first, then filter by price. Both paths reach the same goal, but the sequence of tool calls is different. Testing that the agent took exact step X at position Y would be too rigid and would break with any reasonable change.

An agent that makes five decisions in sequence has five chances to make a mistake, and errors compound. A small mistake in step 2 might lead to a completely wrong outcome by step 5. Testing only the final output misses where things went wrong.

Is a summary “good”? Is a customer service response “helpful”? These are judgment calls that depend on context, user expectations, and organizational standards. Binary pass/fail testing is not sufficient.

Testing an agent is like evaluating a new employee during their probation period. You would not just check whether they produced the right final deliverable. You would also observe:

  • Did they achieve the goal? (Effectiveness) Did the report answer the question they were asked?
  • How efficiently did they work? (Efficiency) Did they spend three days on something that should take three hours?
  • Can they handle curveballs? (Robustness) What happens when they get unclear instructions or missing data?
  • Do they stay within appropriate boundaries? (Safety) Did they access only the systems they should? Did they follow company policy?

You would also look at their process, not just their output. If they got the right answer by sheer luck after a chaotic process, that is different from getting the right answer through a methodical, reliable approach.

Agent evaluation works the same way. You check the final output, the process, the efficiency, and the safety - and you do it systematically.

Every agent evaluation should assess four dimensions:

1. effectiveness - does it achieve the goal?

Section titled “1. effectiveness - does it achieve the goal?”

This is the most basic question: did the agent do what it was supposed to do? If you asked it to book a flight to Tokyo, did it book a flight to Tokyo?

What to measure:

  • Task completion rate (what percentage of tasks does the agent finish successfully?)
  • Correctness of the final output (is the answer right? is the action correct?)
  • User satisfaction (did the user get what they needed?)

Example metrics:

MetricHow to MeasureTarget
Task completion rateAutomated checks against expected outcomes> 90%
Answer correctnessHuman evaluation or automated fact-checking> 85%
User satisfactionPost-interaction surveys or thumbs up/down> 4.0/5.0

An agent that achieves the goal but uses 50 API calls, takes 3 minutes, and costs $2 per query might not be viable. Efficiency measures how much it costs - in time, money, and compute - to get the job done.

What to measure:

  • Latency (how long does the user wait?)
  • Token usage (how many tokens consumed per task?)
  • Number of LLM calls (how many reasoning steps?)
  • Number of tool calls (how many external API calls?)
  • Dollar cost per task

Example metrics:

MetricHow to MeasureTarget
End-to-end latencyTime from user query to final response< 10 seconds
Tokens per taskSum of input + output tokens across all LLM calls< 10,000
LLM calls per taskCount of model invocations< 5
Cost per taskToken cost + API call costs< $0.10

3. robustness - does it handle edge cases?

Section titled “3. robustness - does it handle edge cases?”

Real users do not send perfectly formatted, clear, unambiguous requests. They send typos, vague questions, contradictory instructions, and requests in unexpected formats. A robust agent handles these gracefully.

What to measure:

  • Performance on adversarial or ambiguous inputs
  • Graceful degradation (does it fail safely or crash?)
  • Recovery from errors (can it retry after a tool failure?)
  • Consistency across similar inputs (does a slight rephrasing produce a wildly different result?)

Example test cases:

Test CaseExpected Behavior
Misspelled queryAgent still understands intent
Ambiguous requestAgent asks for clarification
Tool returns an errorAgent retries or uses a fallback
Contradictory instructionsAgent flags the contradiction
Empty or null inputAgent responds gracefully, does not crash
Very long inputAgent handles within context limits

An agent with access to tools can do real damage. It might send emails it should not, delete data, or reveal sensitive information. Safety evaluation checks that the agent respects its boundaries.

What to measure:

  • Policy compliance (does the agent follow the rules defined in its system prompt?)
  • Permission boundaries (does it only use tools and access data it is authorized for?)
  • Refusal of out-of-scope requests (does it appropriately decline tasks it should not do?)
  • Data privacy (does it avoid leaking sensitive information?)

Example test cases:

Test CaseExpected Behavior
User asks agent to perform unauthorized actionAgent refuses and explains why
User tries to get agent to reveal system promptAgent declines
Agent encounters sensitive data during retrievalAgent does not include it in the response
User asks agent to take an action outside its domainAgent redirects to appropriate resource

Agent evaluation requires two categories of metrics that serve different purposes:

System metrics tell you whether the agent is running well from an infrastructure perspective. These are the metrics your SRE team cares about.

MetricWhat It Tells YouHow to Collect
Latency (p50, p95, p99)How long users waitRequest timing in your application
Error rateHow often the agent fails entirelyError counting in logs
Tokens per taskHow much compute each task requiresLLM API response metadata
Cost per taskHow much money each task costsToken counts multiplied by pricing
Tool call success rateHow reliable external integrations areTool wrapper instrumentation
ThroughputHow many requests the system handlesRequest counting

Quality metrics tell you whether the agent is doing a good job from the user’s perspective. These are harder to measure but more important.

MetricWhat It Tells YouHow to Measure
CorrectnessIs the answer right?Ground truth comparison, human evaluation
Trajectory qualityDid the agent take a reasonable path?Trajectory evaluation (see below)
HelpfulnessDid the user get what they needed?User feedback, LLM-as-a-Judge
Safety complianceDid the agent stay within bounds?Red-team testing, policy checkers
GroundednessIs the answer supported by evidence?Source attribution checking
CoherenceDoes the response make sense?LLM-as-a-Judge, human evaluation

When evaluating an agent, start from the outside and work your way in. This mirrors how users experience the agent and ensures you catch the most impactful issues first.

Start by treating the agent as a black box. Give it inputs and check the outputs. Do not look at how it arrived at the answer - just check whether the answer is correct.

How to do it:

  1. Create a test set of input-output pairs (questions and expected answers)
  2. Run each input through the agent
  3. Compare the agent’s output to the expected output
  4. Track pass/fail rates

Example test set:

InputExpected OutputPass Criteria
”What is our refund policy?”Includes 30-day window and receipt requirementContains key policy elements
”Cancel order #12345”Order is cancelled, confirmation providedOrder status changed + confirmation message
”What time does the store close?”Correct closing time for todayMatches actual hours

When this is sufficient: For agents with clear, verifiable outputs where there is one right answer. If the agent’s job is to look up facts or execute well-defined actions, end-to-end testing covers most of what you need.

When end-to-end testing is not sufficient - when you need to understand why the agent succeeded or failed - you open the box and inspect the trajectory.

A trajectory is the full sequence of the agent’s actions: every thought, tool call, observation, and decision, from receiving the input to producing the output.

Example trajectory:

1. User: "What was our revenue last quarter?"
2. Agent thinks: "I need to look up revenue data for Q2 2025"
3. Agent calls: search_financial_reports(query="Q2 2025 revenue")
4. Tool returns: [Q2 2025 Financial Summary document]
5. Agent thinks: "Found the report. Revenue was $12.4M"
6. Agent responds: "Our revenue last quarter (Q2 2025) was $12.4M,
up 8% from Q1 2025."

What to check in the trajectory:

  • Did the agent call the right tools?
  • Did it call them with the right parameters?
  • Did it interpret tool results correctly?
  • Did it avoid unnecessary steps?
  • Did it handle errors appropriately?
  • Did it stay within its authorized actions?

Why trajectory evaluation matters: Two agents might produce the same final answer, but one took a clean, efficient path while the other made several wrong turns, called irrelevant tools, and got lucky. The first agent is more reliable. Trajectory evaluation reveals this difference.

Trajectory evaluation checks the full execution path of the agent. This is one of the most powerful evaluation techniques because it catches problems that end-to-end testing misses.

ComponentDescriptionExample
User inputThe original request”Book me a flight to Tokyo next Tuesday”
Agent reasoningThe agent’s internal thoughts”I need to search for flights on March 25”
Tool callsActions the agent tooksearch_flights(destination=“Tokyo”, date=“2025-03-25”)
Tool resultsWhat the tools returnedList of 5 available flights
Agent decisionsChoices the agent madeSelected the cheapest direct flight
Final outputThe response to the user”I booked flight JL001, departing at 10:30 AM…”

You can evaluate trajectories along several dimensions:

Tool Selection Accuracy: Did the agent pick the right tool for each step?

Good: Agent needs weather data -> calls get_weather()
Bad: Agent needs weather data -> calls search_web("weather forecast")
when get_weather() is available

Parameter Correctness: Did the agent pass the right arguments to each tool?

Good: search_flights(destination="NRT", date="2025-03-25")
Bad: search_flights(destination="Tokyo", date="next Tuesday")
(did not resolve "next Tuesday" to an actual date)

Step Efficiency: Did the agent achieve the goal without unnecessary steps?

Efficient (3 steps):
1. Search flights
2. Select best option
3. Book flight
Inefficient (7 steps):
1. Search flights to Tokyo
2. Search flights to Osaka (not requested)
3. Compare Tokyo and Osaka options (not requested)
4. Search Tokyo hotels (not requested)
5. Go back to flights
6. Select a flight
7. Book flight

Error Handling: When a tool failed, did the agent recover appropriately?

Good: Tool returns error -> Agent retries with modified parameters
Good: Tool returns error -> Agent informs user and suggests alternatives
Bad: Tool returns error -> Agent hallucinates a result
Bad: Tool returns error -> Agent crashes

Human evaluation is the gold standard for quality assessment, but it is slow and expensive. LLM-as-a-Judge uses one model to evaluate another model’s output, giving you automated quality assessment that approximates human judgment.

You give a judge LLM the following:

  1. The original question or task
  2. The agent’s response (or trajectory)
  3. Evaluation criteria
  4. Optionally, a reference answer

The judge LLM then scores the response based on the criteria.

Single scoring asks the judge to rate one response on a scale (e.g., 1-5 for helpfulness). This is simple but tends to suffer from position bias and inconsistent calibration.

Pairwise comparison shows the judge two responses and asks which is better. This is more reliable because relative comparisons are easier than absolute ratings.

Recommendation: Prefer pairwise comparison when possible. It produces more consistent and actionable results.

Judge prompt:
"You are evaluating two responses to a customer question.
The customer asked: 'How do I reset my password?'
Response A:
'Click on Forgot Password on the login page, enter your email,
and follow the link in the reset email.'
Response B:
'You can reset your password through the account settings page
or by contacting our support team at support@example.com.'
Which response is more helpful and complete? Explain your reasoning
and declare a winner."
PracticeWhy It Matters
Use a strong model as judgeWeaker models make worse judgments
Provide clear evaluation criteriaVague criteria lead to inconsistent scoring
Use pairwise comparison over single scoringMore reliable and consistent
Randomize response orderPrevents position bias (models tend to prefer the first response)
Include reference answers when availableGives the judge a baseline for comparison
Validate judge scores against human scoresEnsure the judge correlates with human judgment
Run multiple judge evaluationsReduce variance by averaging across evaluations
  • Judges can have biases (verbosity bias - preferring longer responses, position bias - preferring the first option)
  • Judges may not catch factual errors if they lack domain knowledge
  • Judge quality depends on the judge model’s capabilities
  • Not a full replacement for human evaluation on high-stakes decisions

Despite the power of automated evaluation, human evaluation remains essential for certain aspects of agent quality.

  • Subjective quality: Is the tone appropriate? Is the response empathetic? Does it match the brand voice?
  • Novel situations: When the agent encounters a scenario not covered by automated tests
  • Safety-critical decisions: When the agent is about to take an action with significant consequences
  • Ground truth creation: Building the test sets that automated evaluation relies on
  • Calibrating LLM judges: Validating that your automated judges agree with human judgment

Rating scales: Have evaluators rate responses on specific dimensions (correctness, helpfulness, safety) using a defined scale (1-5 with clear descriptions for each level).

Annotation guidelines: Provide detailed guidelines with examples of what constitutes a 1, 3, and 5 on each dimension. Without this, evaluators will interpret the scale differently.

Inter-annotator agreement: Have multiple evaluators score the same responses. If they disagree significantly, your guidelines need improvement.

Example rating rubric:

ScoreCorrectness Criteria
1Answer is factually wrong or completely off-topic
2Answer has significant errors but shows some understanding
3Answer is mostly correct with minor errors or omissions
4Answer is correct and complete
5Answer is correct, complete, and includes helpful additional context

Evaluation is not a one-time activity. It is a continuous cycle that drives quality improvement over time.

Before you can measure quality, you need to define it. What does “good” mean for your agent? This is specific to your use case.

Questions to answer:

  • What are the must-have behaviors? (e.g., never reveal customer PII)
  • What are the nice-to-have behaviors? (e.g., proactively suggest related actions)
  • What are the unacceptable behaviors? (e.g., making up data, taking unauthorized actions)
  • What are the efficiency targets? (e.g., under 5 seconds, under $0.05 per query)

You cannot improve what you cannot see. Add instrumentation to capture everything the agent does.

What to instrument:

  • Every LLM call (input, output, latency, tokens)
  • Every tool call (input, output, success/failure, latency)
  • The full trajectory for each user interaction
  • User feedback (explicit ratings, implicit signals like retry behavior)
  • System metrics (error rates, throughput, costs)

Run your evaluation framework regularly - not just once at launch.

Cadence:

  • On every code change: Run automated end-to-end tests (like CI/CD tests)
  • Weekly: Review a sample of trajectories for quality
  • Monthly: Run a full evaluation suite including LLM-as-a-Judge and human evaluation
  • Quarterly: Review and update your evaluation criteria and test sets

Use evaluation results to improve the agent. This is where the flywheel spins.

Feedback loop types:

  • Prompt iteration: Evaluation reveals that the agent is too verbose -> Adjust the system prompt to be more concise
  • Tool refinement: Trajectory analysis shows the agent calling a tool with wrong parameters -> Improve the tool description
  • Test set expansion: A production failure reveals an edge case not covered by tests -> Add it to the test set
  • Model selection: Evaluation shows quality drops after a model update -> Roll back or switch models
Define Quality --> Instrument --> Evaluate --> Improve --> Define Quality
^ |
| |
+----------------------------------------------------------+
(Continuous Improvement)

To evaluate and debug agents in production, you need observability. The three pillars of observability apply to agent systems just as they do to any distributed system.

Logs are records of discrete events. For agents, every significant event should be logged.

What to log:

  • User inputs and agent outputs
  • Each step of the agent’s reasoning
  • Tool calls and their results
  • Errors and exceptions
  • Decision points (why did the agent choose path A over path B?)

Log structure example:

{
"timestamp": "2025-03-18T10:30:00Z",
"session_id": "sess_abc123",
"step": 3,
"type": "tool_call",
"tool": "search_documents",
"input": {"query": "refund policy"},
"output": {"documents": ["doc_456"], "count": 1},
"latency_ms": 230,
"status": "success"
}

Traces follow a single request through the entire system, connecting all the steps into a coherent story. This is critical for multi-step agents where a single user query might trigger dozens of LLM calls and tool invocations.

What a trace looks like:

Trace: user_query_789
|-- LLM Call 1: Parse user intent (120ms)
|-- Tool Call 1: search_orders (250ms)
|-- LLM Call 2: Evaluate results (90ms)
|-- Tool Call 2: get_order_details (180ms)
|-- LLM Call 3: Generate response (150ms)
Total: 790ms, 3 LLM calls, 2 tool calls, 4,200 tokens

Traces let you answer questions like:

  • Where did the agent spend most of its time?
  • Which tool call failed?
  • At which step did the agent’s reasoning go wrong?

Metrics are numerical measurements aggregated over time. They tell you about trends and patterns rather than individual events.

Key metrics to track:

MetricAggregationAlert Threshold
Task completion rateDaily averageBelow 85%
Average latencyp50, p95, p99 per hourp95 above 15 seconds
Error ratePer hourAbove 5%
Token costDaily totalAbove daily budget
Tool call failure ratePer tool per hourAbove 10%
User satisfactionWeekly averageBelow 3.5/5.0

Here is a practical approach to building an evaluation suite for your agent:

Build a set of 50-100 test cases that cover your agent’s core use cases, edge cases, and failure modes.

Structure each test case:

{
"id": "test_001",
"input": "What is the status of order #12345?",
"expected_output": "Order #12345 was shipped on March 15
and is expected to arrive by March 18.",
"expected_tool_calls": ["lookup_order"],
"category": "order_status",
"difficulty": "easy",
"tags": ["happy_path", "single_tool"]
}

Categories to include:

  • Happy path (common, straightforward requests)
  • Edge cases (unusual inputs, boundary conditions)
  • Error handling (tool failures, invalid inputs)
  • Safety (out-of-scope requests, attempts to bypass policies)
  • Multi-step (tasks requiring multiple tool calls)

For each test case, define automated pass/fail criteria:

  • Exact match: The output must contain specific strings (good for factual lookups)
  • Semantic similarity: The output must be semantically similar to the expected answer (good for open-ended responses)
  • Tool call verification: The agent must call specific tools with specific parameters
  • Negative checks: The output must NOT contain certain strings (good for safety testing)

For test cases where automated checks are insufficient, add LLM-as-a-Judge evaluation:

Judge prompt template:
"Evaluate the following agent response on a scale of 1-5
for each dimension:
User question: {question}
Agent response: {response}
Reference answer: {reference}
Dimensions:
1. Correctness (1-5): Is the information accurate?
2. Completeness (1-5): Does it address all parts of the question?
3. Helpfulness (1-5): Would the user find this useful?
4. Safety (1-5): Does it stay within appropriate bounds?
Provide a score and one-sentence justification for each dimension."

Integrate your evaluation suite into your continuous integration pipeline:

  • On every pull request: Run the golden test set with automated checks
  • Nightly: Run the full evaluation suite including LLM-as-a-Judge
  • On model changes: Run the complete suite and compare against the previous model’s scores

Store evaluation results in a database or spreadsheet and track trends:

  • Is task completion rate improving or declining?
  • Are there categories where quality is dropping?
  • How do scores change when you update the prompt or model?
  • Are there new failure patterns emerging?

Google Cloud provides tools for evaluating agents at scale:

Vertex AI’s evaluation capabilities let you run structured evaluations on agent outputs. You can define evaluation criteria, run evaluations at scale, and track results over time.

The Google Agent Development Kit evaluation framework provides built-in support for testing agents during development. It integrates with the ADK’s agent definition format and supports both automated checks and LLM-as-a-Judge evaluation.

If your test set only contains well-formatted, clear, unambiguous inputs, you are not testing what happens in the real world. At least 30% of your test cases should cover edge cases, error conditions, and adversarial inputs.

An agent that produces the right answer through a wrong process is unreliable. It might have gotten lucky. Always evaluate trajectories alongside final outputs.

If your test set never changes, it becomes stale. New features, new failure modes, and new user patterns all require new test cases. Review and update your test set monthly.

An agent that is 95% correct but costs $5 per query and takes 30 seconds is not production-ready for most use cases. Always include efficiency metrics in your evaluation.

Agent behavior can change in production due to different data, higher load, model updates, and real user inputs that differ from your test cases. Monitor agent quality continuously in production, not just during development.

Without a baseline, you cannot tell if your agent is improving. Before making changes, always measure current performance so you have a point of comparison.

Build an evaluation suite for an agent of your choice:

  1. Define quality dimensions: List 3-5 specific quality criteria for your agent (e.g., correctness, tone, efficiency).

  2. Create a golden test set: Write 20 test cases covering happy path (10), edge cases (5), and safety (5). Include expected outputs and pass/fail criteria.

  3. Implement automated evaluation: Build a script that:

    • Runs each test case through the agent
    • Checks automated pass/fail criteria
    • Uses LLM-as-a-Judge for subjective dimensions
    • Produces a summary report
  4. Evaluate trajectory quality: For 5 test cases, capture the full trajectory and evaluate:

    • Were the right tools called?
    • Were parameters correct?
    • Was the path efficient?
  5. Document findings: Write up what you learned about your agent’s strengths and weaknesses.

  • Agent evaluation is harder than testing traditional software because outputs are non-deterministic, many paths can be valid, and quality is often subjective.
  • Evaluate four pillars: effectiveness (does it work?), efficiency (at what cost?), robustness (does it handle edge cases?), and safety (does it stay in bounds?).
  • Track both system metrics (latency, error rate, cost) and quality metrics (correctness, helpfulness, safety compliance).
  • Use the outside-in approach: start with black-box end-to-end tests, then open up to inspect trajectories when you need to understand why.
  • Trajectory evaluation checks the full execution path - the tools called, parameters used, and decisions made - not just the final answer.
  • LLM-as-a-Judge automates quality assessment. Prefer pairwise comparison over single scoring for more reliable results.
  • Human evaluation remains essential for subjective quality, novel situations, and calibrating automated judges.
  • Build the quality flywheel: define quality, instrument for visibility, evaluate the process, architect feedback loops, and repeat.
  • Observability through logs, traces, and metrics gives you the data needed to evaluate and debug agents in production.