Lesson 9: evaluating and testing agents - how to know if your agent works

Introduction

You have built an agent. It has tools, memory, and a planning loop. It can retrieve information, take actions, and converse with users. But how do you know if it actually works well?

Testing an agent is fundamentally different from testing traditional software. With a regular function, you pass in inputs and check that the outputs match your expectations. With an agent, the same question can produce different valid answers, the agent might take different paths to reach the same goal, and “correct” is often subjective. The output is non-deterministic, the process is multi-step, and the quality dimensions are more nuanced than pass/fail.

This lesson covers how to evaluate agents rigorously - what to measure, how to measure it, and how to build evaluation into your development workflow so that quality improves continuously.

Why agent evaluation is hard

Non-Deterministic Outputs

Ask a traditional function to add 2 + 2 and it always returns 4. Ask an agent “What should I do about our declining user retention?” and you might get a different answer every time - and multiple different answers could all be good. Temperature settings, model updates, and slight prompt variations all change the output.

Many valid paths

An agent that books a flight might search by price first, then by time. Or it might search by time first, then filter by price. Both paths reach the same goal, but the sequence of tool calls is different. Testing that the agent took exact step X at position Y would be too rigid and would break with any reasonable change.

Compound errors

An agent that makes five decisions in sequence has five chances to make a mistake, and errors compound. A small mistake in step 2 might lead to a completely wrong outcome by step 5. Testing only the final output misses where things went wrong.

Subjective quality

Is a summary “good”? Is a customer service response “helpful”? These are judgment calls that depend on context, user expectations, and organizational standards. Binary pass/fail testing is not sufficient.

ELI5: the new employee analogy

Testing an agent is like evaluating a new employee during their probation period. You would not just check whether they produced the right final deliverable. You would also observe:

Did they achieve the goal? (Effectiveness) Did the report answer the question they were asked?
How efficiently did they work? (Efficiency) Did they spend three days on something that should take three hours?
Can they handle curveballs? (Robustness) What happens when they get unclear instructions or missing data?
Do they stay within appropriate boundaries? (Safety) Did they access only the systems they should? Did they follow company policy?

You would also look at their process, not just their output. If they got the right answer by sheer luck after a chaotic process, that is different from getting the right answer through a methodical, reliable approach.

Agent evaluation works the same way. You check the final output, the process, the efficiency, and the safety - and you do it systematically.

Four pillars of agent quality

Every agent evaluation should assess four dimensions:

1. effectiveness - does it achieve the goal?

This is the most basic question: did the agent do what it was supposed to do? If you asked it to book a flight to Tokyo, did it book a flight to Tokyo?

What to measure:

Task completion rate (what percentage of tasks does the agent finish successfully?)
Correctness of the final output (is the answer right? is the action correct?)
User satisfaction (did the user get what they needed?)

Example metrics:

Metric	How to Measure	Target
Task completion rate	Automated checks against expected outcomes	> 90%
Answer correctness	Human evaluation or automated fact-checking	> 85%
User satisfaction	Post-interaction surveys or thumbs up/down	> 4.0/5.0

2. efficiency - at what cost?

An agent that achieves the goal but uses 50 API calls, takes 3 minutes, and costs $2 per query might not be viable. Efficiency measures how much it costs - in time, money, and compute - to get the job done.

What to measure:

Latency (how long does the user wait?)
Token usage (how many tokens consumed per task?)
Number of LLM calls (how many reasoning steps?)
Number of tool calls (how many external API calls?)
Dollar cost per task

Example metrics:

Metric	How to Measure	Target
End-to-end latency	Time from user query to final response	< 10 seconds
Tokens per task	Sum of input + output tokens across all LLM calls	< 10,000
LLM calls per task	Count of model invocations	< 5
Cost per task	Token cost + API call costs	< $0.10

3. robustness - does it handle edge cases?

Real users do not send perfectly formatted, clear, unambiguous requests. They send typos, vague questions, contradictory instructions, and requests in unexpected formats. A robust agent handles these gracefully.

What to measure:

Performance on adversarial or ambiguous inputs
Graceful degradation (does it fail safely or crash?)
Recovery from errors (can it retry after a tool failure?)
Consistency across similar inputs (does a slight rephrasing produce a wildly different result?)

Example test cases:

Test Case	Expected Behavior
Misspelled query	Agent still understands intent
Ambiguous request	Agent asks for clarification
Tool returns an error	Agent retries or uses a fallback
Contradictory instructions	Agent flags the contradiction
Empty or null input	Agent responds gracefully, does not crash
Very long input	Agent handles within context limits

4. safety - does it stay within bounds?

An agent with access to tools can do real damage. It might send emails it should not, delete data, or reveal sensitive information. Safety evaluation checks that the agent respects its boundaries.

What to measure:

Policy compliance (does the agent follow the rules defined in its system prompt?)
Permission boundaries (does it only use tools and access data it is authorized for?)
Refusal of out-of-scope requests (does it appropriately decline tasks it should not do?)
Data privacy (does it avoid leaking sensitive information?)

Example test cases:

Test Case	Expected Behavior
User asks agent to perform unauthorized action	Agent refuses and explains why
User tries to get agent to reveal system prompt	Agent declines
Agent encounters sensitive data during retrieval	Agent does not include it in the response
User asks agent to take an action outside its domain	Agent redirects to appropriate resource

System metrics vs. quality metrics

Agent evaluation requires two categories of metrics that serve different purposes:

System metrics (operational health)

System metrics tell you whether the agent is running well from an infrastructure perspective. These are the metrics your SRE team cares about.

Metric	What It Tells You	How to Collect
Latency (p50, p95, p99)	How long users wait	Request timing in your application
Error rate	How often the agent fails entirely	Error counting in logs
Tokens per task	How much compute each task requires	LLM API response metadata
Cost per task	How much money each task costs	Token counts multiplied by pricing
Tool call success rate	How reliable external integrations are	Tool wrapper instrumentation
Throughput	How many requests the system handles	Request counting

Quality metrics (output goodness)

Quality metrics tell you whether the agent is doing a good job from the user’s perspective. These are harder to measure but more important.

Metric	What It Tells You	How to Measure
Correctness	Is the answer right?	Ground truth comparison, human evaluation
Trajectory quality	Did the agent take a reasonable path?	Trajectory evaluation (see below)
Helpfulness	Did the user get what they needed?	User feedback, LLM-as-a-Judge
Safety compliance	Did the agent stay within bounds?	Red-team testing, policy checkers
Groundedness	Is the answer supported by evidence?	Source attribution checking
Coherence	Does the response make sense?	LLM-as-a-Judge, human evaluation

The Outside-In Approach

When evaluating an agent, start from the outside and work your way in. This mirrors how users experience the agent and ensures you catch the most impactful issues first.

Level 1: Black-Box End-to-End Testing

Start by treating the agent as a black box. Give it inputs and check the outputs. Do not look at how it arrived at the answer - just check whether the answer is correct.

How to do it:

Create a test set of input-output pairs (questions and expected answers)
Run each input through the agent
Compare the agent’s output to the expected output
Track pass/fail rates

Example test set:

Input	Expected Output	Pass Criteria
”What is our refund policy?”	Includes 30-day window and receipt requirement	Contains key policy elements
”Cancel order #12345”	Order is cancelled, confirmation provided	Order status changed + confirmation message
”What time does the store close?”	Correct closing time for today	Matches actual hours

When this is sufficient: For agents with clear, verifiable outputs where there is one right answer. If the agent’s job is to look up facts or execute well-defined actions, end-to-end testing covers most of what you need.

Level 2: glass-box trajectory evaluation

When end-to-end testing is not sufficient - when you need to understand why the agent succeeded or failed - you open the box and inspect the trajectory.

A trajectory is the full sequence of the agent’s actions: every thought, tool call, observation, and decision, from receiving the input to producing the output.

Example trajectory:

1. User: "What was our revenue last quarter?"
2. Agent thinks: "I need to look up revenue data for Q2 2025"
3. Agent calls: search_financial_reports(query="Q2 2025 revenue")
4. Tool returns: [Q2 2025 Financial Summary document]
5. Agent thinks: "Found the report. Revenue was $12.4M"
6. Agent responds: "Our revenue last quarter (Q2 2025) was $12.4M,
   up 8% from Q1 2025."

What to check in the trajectory:

Did the agent call the right tools?
Did it call them with the right parameters?
Did it interpret tool results correctly?
Did it avoid unnecessary steps?
Did it handle errors appropriately?
Did it stay within its authorized actions?

Why trajectory evaluation matters: Two agents might produce the same final answer, but one took a clean, efficient path while the other made several wrong turns, called irrelevant tools, and got lucky. The first agent is more reliable. Trajectory evaluation reveals this difference.

Trajectory evaluation in detail

Trajectory evaluation checks the full execution path of the agent. This is one of the most powerful evaluation techniques because it catches problems that end-to-end testing misses.

What a trajectory contains

Component	Description	Example
User input	The original request	”Book me a flight to Tokyo next Tuesday”
Agent reasoning	The agent’s internal thoughts	”I need to search for flights on March 25”
Tool calls	Actions the agent took	search_flights(destination=“Tokyo”, date=“2025-03-25”)
Tool results	What the tools returned	List of 5 available flights
Agent decisions	Choices the agent made	Selected the cheapest direct flight
Final output	The response to the user	”I booked flight JL001, departing at 10:30 AM…”

Evaluating trajectories

You can evaluate trajectories along several dimensions:

Tool Selection Accuracy: Did the agent pick the right tool for each step?

Good: Agent needs weather data -> calls get_weather()
Bad:  Agent needs weather data -> calls search_web("weather forecast")
      when get_weather() is available

Parameter Correctness: Did the agent pass the right arguments to each tool?

Good: search_flights(destination="NRT", date="2025-03-25")
Bad:  search_flights(destination="Tokyo", date="next Tuesday")
      (did not resolve "next Tuesday" to an actual date)

Step Efficiency: Did the agent achieve the goal without unnecessary steps?

Efficient (3 steps):
  1. Search flights
  2. Select best option
  3. Book flight

Inefficient (7 steps):
  1. Search flights to Tokyo
  2. Search flights to Osaka (not requested)
  3. Compare Tokyo and Osaka options (not requested)
  4. Search Tokyo hotels (not requested)
  5. Go back to flights
  6. Select a flight
  7. Book flight

Error Handling: When a tool failed, did the agent recover appropriately?

Good: Tool returns error -> Agent retries with modified parameters
Good: Tool returns error -> Agent informs user and suggests alternatives
Bad:  Tool returns error -> Agent hallucinates a result
Bad:  Tool returns error -> Agent crashes

LLM-as-a-Judge

Human evaluation is the gold standard for quality assessment, but it is slow and expensive. LLM-as-a-Judge uses one model to evaluate another model’s output, giving you automated quality assessment that approximates human judgment.

How it works

You give a judge LLM the following:

The original question or task
The agent’s response (or trajectory)
Evaluation criteria
Optionally, a reference answer

The judge LLM then scores the response based on the criteria.

Single scoring vs. pairwise comparison

Single scoring asks the judge to rate one response on a scale (e.g., 1-5 for helpfulness). This is simple but tends to suffer from position bias and inconsistent calibration.

Pairwise comparison shows the judge two responses and asks which is better. This is more reliable because relative comparisons are easier than absolute ratings.

Recommendation: Prefer pairwise comparison when possible. It produces more consistent and actionable results.

Pairwise comparison example

Judge prompt:
"You are evaluating two responses to a customer question.
The customer asked: 'How do I reset my password?'

Response A:
'Click on Forgot Password on the login page, enter your email,
and follow the link in the reset email.'

Response B:
'You can reset your password through the account settings page
or by contacting our support team at support@example.com.'

Which response is more helpful and complete? Explain your reasoning
and declare a winner."

Best practices for llm-as-a-judge

Practice	Why It Matters
Use a strong model as judge	Weaker models make worse judgments
Provide clear evaluation criteria	Vague criteria lead to inconsistent scoring
Use pairwise comparison over single scoring	More reliable and consistent
Randomize response order	Prevents position bias (models tend to prefer the first response)
Include reference answers when available	Gives the judge a baseline for comparison
Validate judge scores against human scores	Ensure the judge correlates with human judgment
Run multiple judge evaluations	Reduce variance by averaging across evaluations

Limitations of LLM-as-a-Judge

Judges can have biases (verbosity bias - preferring longer responses, position bias - preferring the first option)
Judges may not catch factual errors if they lack domain knowledge
Judge quality depends on the judge model’s capabilities
Not a full replacement for human evaluation on high-stakes decisions

Human evaluation

Despite the power of automated evaluation, human evaluation remains essential for certain aspects of agent quality.

When you need humans

Subjective quality: Is the tone appropriate? Is the response empathetic? Does it match the brand voice?
Novel situations: When the agent encounters a scenario not covered by automated tests
Safety-critical decisions: When the agent is about to take an action with significant consequences
Ground truth creation: Building the test sets that automated evaluation relies on
Calibrating LLM judges: Validating that your automated judges agree with human judgment

Structuring human evaluation

Rating scales: Have evaluators rate responses on specific dimensions (correctness, helpfulness, safety) using a defined scale (1-5 with clear descriptions for each level).

Annotation guidelines: Provide detailed guidelines with examples of what constitutes a 1, 3, and 5 on each dimension. Without this, evaluators will interpret the scale differently.

Inter-annotator agreement: Have multiple evaluators score the same responses. If they disagree significantly, your guidelines need improvement.

Example rating rubric:

Score	Correctness Criteria
1	Answer is factually wrong or completely off-topic
2	Answer has significant errors but shows some understanding
3	Answer is mostly correct with minor errors or omissions
4	Answer is correct and complete
5	Answer is correct, complete, and includes helpful additional context

The agent quality flywheel

Evaluation is not a one-time activity. It is a continuous cycle that drives quality improvement over time.

Step 1: define quality

Before you can measure quality, you need to define it. What does “good” mean for your agent? This is specific to your use case.

Questions to answer:

What are the must-have behaviors? (e.g., never reveal customer PII)
What are the nice-to-have behaviors? (e.g., proactively suggest related actions)
What are the unacceptable behaviors? (e.g., making up data, taking unauthorized actions)
What are the efficiency targets? (e.g., under 5 seconds, under $0.05 per query)

Step 2: instrument for visibility

You cannot improve what you cannot see. Add instrumentation to capture everything the agent does.

What to instrument:

Every LLM call (input, output, latency, tokens)
Every tool call (input, output, success/failure, latency)
The full trajectory for each user interaction
User feedback (explicit ratings, implicit signals like retry behavior)
System metrics (error rates, throughput, costs)

Step 3: evaluate the process

Run your evaluation framework regularly - not just once at launch.

Cadence:

On every code change: Run automated end-to-end tests (like CI/CD tests)
Weekly: Review a sample of trajectories for quality
Monthly: Run a full evaluation suite including LLM-as-a-Judge and human evaluation
Quarterly: Review and update your evaluation criteria and test sets

Step 4: architect feedback loops

Use evaluation results to improve the agent. This is where the flywheel spins.

Feedback loop types:

Prompt iteration: Evaluation reveals that the agent is too verbose -> Adjust the system prompt to be more concise
Tool refinement: Trajectory analysis shows the agent calling a tool with wrong parameters -> Improve the tool description
Test set expansion: A production failure reveals an edge case not covered by tests -> Add it to the test set
Model selection: Evaluation shows quality drops after a model update -> Roll back or switch models

Define Quality --> Instrument --> Evaluate --> Improve --> Define Quality
     ^                                                          |
     |                                                          |
     +----------------------------------------------------------+
                     (Continuous Improvement)

Observability: the three pillars

To evaluate and debug agents in production, you need observability. The three pillars of observability apply to agent systems just as they do to any distributed system.

Logs

Logs are records of discrete events. For agents, every significant event should be logged.

What to log:

User inputs and agent outputs
Each step of the agent’s reasoning
Tool calls and their results
Errors and exceptions
Decision points (why did the agent choose path A over path B?)

Log structure example:

{
  "timestamp": "2025-03-18T10:30:00Z",
  "session_id": "sess_abc123",
  "step": 3,
  "type": "tool_call",
  "tool": "search_documents",
  "input": {"query": "refund policy"},
  "output": {"documents": ["doc_456"], "count": 1},
  "latency_ms": 230,
  "status": "success"
}

Traces

Traces follow a single request through the entire system, connecting all the steps into a coherent story. This is critical for multi-step agents where a single user query might trigger dozens of LLM calls and tool invocations.

What a trace looks like:

Trace: user_query_789
  |-- LLM Call 1: Parse user intent (120ms)
  |-- Tool Call 1: search_orders (250ms)
  |-- LLM Call 2: Evaluate results (90ms)
  |-- Tool Call 2: get_order_details (180ms)
  |-- LLM Call 3: Generate response (150ms)
  Total: 790ms, 3 LLM calls, 2 tool calls, 4,200 tokens

Traces let you answer questions like:

Where did the agent spend most of its time?
Which tool call failed?
At which step did the agent’s reasoning go wrong?

Metrics

Metrics are numerical measurements aggregated over time. They tell you about trends and patterns rather than individual events.

Key metrics to track:

Metric	Aggregation	Alert Threshold
Task completion rate	Daily average	Below 85%
Average latency	p50, p95, p99 per hour	p95 above 15 seconds
Error rate	Per hour	Above 5%
Token cost	Daily total	Above daily budget
Tool call failure rate	Per tool per hour	Above 10%
User satisfaction	Weekly average	Below 3.5/5.0

Building an evaluation suite

Here is a practical approach to building an evaluation suite for your agent:

1. create a golden test set

Build a set of 50-100 test cases that cover your agent’s core use cases, edge cases, and failure modes.

Structure each test case:

{
  "id": "test_001",
  "input": "What is the status of order #12345?",
  "expected_output": "Order #12345 was shipped on March 15
                      and is expected to arrive by March 18.",
  "expected_tool_calls": ["lookup_order"],
  "category": "order_status",
  "difficulty": "easy",
  "tags": ["happy_path", "single_tool"]
}

Categories to include:

Happy path (common, straightforward requests)
Edge cases (unusual inputs, boundary conditions)
Error handling (tool failures, invalid inputs)
Safety (out-of-scope requests, attempts to bypass policies)
Multi-step (tasks requiring multiple tool calls)

2. implement automated checks

For each test case, define automated pass/fail criteria:

Exact match: The output must contain specific strings (good for factual lookups)
Semantic similarity: The output must be semantically similar to the expected answer (good for open-ended responses)
Tool call verification: The agent must call specific tools with specific parameters
Negative checks: The output must NOT contain certain strings (good for safety testing)

3. add llm-as-a-judge scoring

For test cases where automated checks are insufficient, add LLM-as-a-Judge evaluation:

Judge prompt template:
"Evaluate the following agent response on a scale of 1-5
for each dimension:

User question: {question}
Agent response: {response}
Reference answer: {reference}

Dimensions:
1. Correctness (1-5): Is the information accurate?
2. Completeness (1-5): Does it address all parts of the question?
3. Helpfulness (1-5): Would the user find this useful?
4. Safety (1-5): Does it stay within appropriate bounds?

Provide a score and one-sentence justification for each dimension."

4. run evaluations in CI/CD

Integrate your evaluation suite into your continuous integration pipeline:

On every pull request: Run the golden test set with automated checks
Nightly: Run the full evaluation suite including LLM-as-a-Judge
On model changes: Run the complete suite and compare against the previous model’s scores

5. track results over time

Store evaluation results in a database or spreadsheet and track trends:

Is task completion rate improving or declining?
Are there categories where quality is dropping?
How do scores change when you update the prompt or model?
Are there new failure patterns emerging?

Evaluation with Google Cloud

Google Cloud provides tools for evaluating agents at scale:

Vertex AI Evaluation

Vertex AI’s evaluation capabilities let you run structured evaluations on agent outputs. You can define evaluation criteria, run evaluations at scale, and track results over time.

Google ADK Evaluation

The Google Agent Development Kit evaluation framework provides built-in support for testing agents during development. It integrates with the ADK’s agent definition format and supports both automated checks and LLM-as-a-Judge evaluation.

Common evaluation mistakes

Mistake 1: only testing the happy path

If your test set only contains well-formatted, clear, unambiguous inputs, you are not testing what happens in the real world. At least 30% of your test cases should cover edge cases, error conditions, and adversarial inputs.

Mistake 2: only checking the final output

An agent that produces the right answer through a wrong process is unreliable. It might have gotten lucky. Always evaluate trajectories alongside final outputs.

Mistake 3: static test sets

If your test set never changes, it becomes stale. New features, new failure modes, and new user patterns all require new test cases. Review and update your test set monthly.

Mistake 4: ignoring cost and latency

An agent that is 95% correct but costs $5 per query and takes 30 seconds is not production-ready for most use cases. Always include efficiency metrics in your evaluation.

Mistake 5: testing only in development

Agent behavior can change in production due to different data, higher load, model updates, and real user inputs that differ from your test cases. Monitor agent quality continuously in production, not just during development.

Mistake 6: no baseline comparison

Without a baseline, you cannot tell if your agent is improving. Before making changes, always measure current performance so you have a point of comparison.

Hands-On Exercise

Build an evaluation suite for an agent of your choice:

Define quality dimensions: List 3-5 specific quality criteria for your agent (e.g., correctness, tone, efficiency).
Create a golden test set: Write 20 test cases covering happy path (10), edge cases (5), and safety (5). Include expected outputs and pass/fail criteria.
Implement automated evaluation: Build a script that:
- Runs each test case through the agent
- Checks automated pass/fail criteria
- Uses LLM-as-a-Judge for subjective dimensions
- Produces a summary report
Evaluate trajectory quality: For 5 test cases, capture the full trajectory and evaluate:
- Were the right tools called?
- Were parameters correct?
- Was the path efficient?
Document findings: Write up what you learned about your agent’s strengths and weaknesses.

Key takeaways

Agent evaluation is harder than testing traditional software because outputs are non-deterministic, many paths can be valid, and quality is often subjective.
Evaluate four pillars: effectiveness (does it work?), efficiency (at what cost?), robustness (does it handle edge cases?), and safety (does it stay in bounds?).
Track both system metrics (latency, error rate, cost) and quality metrics (correctness, helpfulness, safety compliance).
Use the outside-in approach: start with black-box end-to-end tests, then open up to inspect trajectories when you need to understand why.
Trajectory evaluation checks the full execution path - the tools called, parameters used, and decisions made - not just the final answer.
LLM-as-a-Judge automates quality assessment. Prefer pairwise comparison over single scoring for more reliable results.
Human evaluation remains essential for subjective quality, novel situations, and calibrating automated judges.
Build the quality flywheel: define quality, instrument for visibility, evaluate the process, architect feedback loops, and repeat.
Observability through logs, traces, and metrics gives you the data needed to evaluate and debug agents in production.

Lesson 9: evaluating and testing agents - how to know if your agent works

Introduction

Why agent evaluation is hard

Non-Deterministic Outputs

Many valid paths

Compound errors

Subjective quality

ELI5: the new employee analogy

Four pillars of agent quality

1. effectiveness - does it achieve the goal?

2. efficiency - at what cost?

3. robustness - does it handle edge cases?

4. safety - does it stay within bounds?

System metrics vs. quality metrics

System metrics (operational health)

Quality metrics (output goodness)

The Outside-In Approach

Level 1: Black-Box End-to-End Testing

Level 2: glass-box trajectory evaluation

Trajectory evaluation in detail

What a trajectory contains

Evaluating trajectories

LLM-as-a-Judge

How it works

Single scoring vs. pairwise comparison

Pairwise comparison example

Best practices for llm-as-a-judge

Limitations of LLM-as-a-Judge

Human evaluation

When you need humans

Structuring human evaluation

The agent quality flywheel

Step 1: define quality

Step 2: instrument for visibility

Step 3: evaluate the process

Step 4: architect feedback loops

Observability: the three pillars

Logs

Traces

Metrics

Building an evaluation suite

1. create a golden test set

2. implement automated checks

3. add llm-as-a-judge scoring

4. run evaluations in CI/CD

5. track results over time

Evaluation with Google Cloud

Vertex AI Evaluation

Google ADK Evaluation

Common evaluation mistakes

Mistake 1: only testing the happy path

Mistake 2: only checking the final output

Mistake 3: static test sets

Mistake 4: ignoring cost and latency

Mistake 5: testing only in development

Mistake 6: no baseline comparison

Hands-On Exercise

Key takeaways

Further reading