Skip to content

Lesson 11: from prototype to production - shipping your agent

You have built a working agent. It handles your test cases, impresses your team in demos, and feels like magic. Now you need to ship it to real users. This is where things get hard.

The gap between “works on my laptop” and “works reliably for thousands of users” is enormous. In traditional software, production readiness mostly means handling edge cases, scaling, and monitoring. With agents, you have all of that plus the fundamental challenge that your system’s behavior is non-deterministic and hard to fully predict.

ELI5: Taking an agent to production is like opening a restaurant

Section titled “ELI5: Taking an agent to production is like opening a restaurant”

You have been cooking great meals at home for your friends. Everyone loves your food. Now you want to open a restaurant. Suddenly you need to think about things that never mattered at home: health inspections, supply chains, consistent recipes so every dish tastes the same, training staff, handling complaints, managing costs, and making sure the kitchen does not catch fire on a busy Saturday night. The cooking skill is the same, but everything around it changes completely.

That is the prototype-to-production gap. Your agent’s core logic might not change much, but everything around it - evaluation, deployment, monitoring, cost management, team processes - needs to be built from scratch.

Key takeaway: The “last mile” from prototype to production is often 80% of the total effort. Plan for it from the start.


Here is what changes when you move from prototype to production:

DimensionPrototypeProduction
UsersYou and your teamHundreds or thousands of real users
InputsCurated test casesAnything anyone types, including adversarial input
UptimeRestart when it breaksMust be available 24/7 with graceful degradation
Latency”It takes a few seconds” is fineUsers expect sub-second responses for simple queries
CostBurn rate does not matterEvery token costs money at scale
Quality”Usually works” is acceptableConsistent quality is required; bad responses erode trust
SafetyInformal testingSystematic guardrails, monitoring, and incident response
DebuggingPrint statementsStructured logs, traces, and metrics
UpdatesEdit and restartCI/CD pipeline with evaluation gates

Demos work because the person giving the demo knows what inputs work well. They avoid edge cases. They retry failures off-screen. They pick the best example from multiple runs.

Production is the opposite. Real users will:

  • Misspell things, use slang, write in unexpected languages
  • Ask questions your agent was never designed to handle
  • Provide extremely long or extremely short inputs
  • Try the exact same query many times if they are unhappy with the result
  • Discover failure modes you never imagined

This is why evaluation-gated deployment is so important. You should not ship an agent version that has not been tested against a comprehensive set of real-world scenarios.


Productionizing an agent is not a solo effort. It typically involves several roles working together:

RoleResponsibility
AI EngineerAgent logic, prompt design, tool integration, eval creation
Platform EngineerInfrastructure, deployment pipelines, service mesh, scaling
Data EngineerData pipelines for RAG, knowledge bases, training data management
ML Ops / AI OpsModel serving, versioning, A/B testing, monitoring dashboards
DevOps / SREReliability, incident response, alerting, cost tracking
Product ManagerUser requirements, success metrics, prioritization
Security / Trust & SafetyGuardrails, red teaming, compliance, safety reviews

In smaller teams, one person might wear multiple hats. But the responsibilities still exist regardless of how many people share them.


This is the single most important practice for shipping agents safely. The principle is simple: no agent version ships without passing evals.

In Lesson 9, we covered how to build evals. Here is how they fit into the deployment process:

Code Change --> Evals Pass? --No--> Fix and retry
|
Yes
|
v
Deploy to staging
|
v
Staging evals pass? --No--> Fix and retry
|
Yes
|
v
Deploy to production (canary)
|
v
Production metrics OK? --No--> Rollback
|
Yes
|
v
Full production rollout

Your eval suite for deployment should cover:

CategoryWhat to TestPass Criteria
Functional correctnessDoes the agent produce correct answers?>= threshold on accuracy metrics
Tool usageDoes the agent call the right tools with correct arguments?Tools called correctly in >= X% of cases
SafetyDoes the agent resist prompt injection and follow policies?100% pass rate on safety-critical cases
LatencyDoes the agent respond within acceptable time?P95 latency < target
CostDoes the agent stay within token budgets?Average cost per interaction < budget
RegressionDo previously passing cases still pass?No regressions on known-good cases

Safety evals should have a hard gate - any failure blocks deployment. Other categories might have softer thresholds where you accept small regressions if overall quality improves.


Continuous integration and continuous deployment for agents follows the same principles as traditional CI/CD but with agent-specific steps. Think of it in three phases.

Phase 1: Pre-merge (on every pull request)

Section titled “Phase 1: Pre-merge (on every pull request)”

These checks run quickly and catch obvious problems before code is merged.

# Example: Pre-merge checks
pre_merge:
- lint:
- Check prompt formatting and syntax
- Validate tool definitions match schemas
- Static analysis of agent configuration
- unit_tests:
- Test individual tool functions
- Test guardrail logic
- Test input/output parsers
- basic_evals:
- Run a small, fast eval set (50-100 cases)
- Focus on regression detection
- Target: completes in < 5 minutes

Phase 2: Post-merge (on every merge to main)

Section titled “Phase 2: Post-merge (on every merge to main)”

After code is merged, run more comprehensive checks before promoting to staging.

# Example: Post-merge validation
post_merge:
- staging_deployment:
- Deploy to staging environment
- Verify health checks pass
- broad_evals:
- Run full eval suite (500-1000+ cases)
- Include safety evals
- Include latency and cost benchmarks
- Target: completes in < 30 minutes
- integration_tests:
- Test end-to-end flows with real tool connections
- Verify external service integrations

Phase 3: Production gate (before production deployment)

Section titled “Phase 3: Production gate (before production deployment)”

The final check before real users see the new version.

# Example: Production gate
production_gate:
- full_evals:
- Complete eval suite including edge cases
- Adversarial test cases
- Cross-model consistency checks (if using multiple models)
- safety_review:
- Automated safety evals must pass 100%
- Human review for significant prompt changes
- Red team sign-off for major feature changes
- approval:
- Automated approval if all checks pass
- Manual approval required if any check is marginal

Prompts deserve the same version control discipline as code:

  • Store prompts in version control (not in a database or config service that is hard to diff)
  • Review prompt changes in pull requests just like code changes
  • Track which prompt version is deployed to which environment
  • Make it easy to roll back to a previous prompt version
prompts/
customer_support/
system_prompt.txt # The main system instructions
tool_descriptions.txt # Tool descriptions and schemas
safety_rules.txt # Safety-specific instructions
version.txt # Current version identifier

Even with comprehensive evals, production can surprise you. Safe rollout strategies limit the blast radius when something goes wrong.

Route a small percentage of traffic to the new version. Monitor for problems before increasing the percentage.

Traffic ---> [Load Balancer]
| |
95% 5%
| |
v v
[Version 1] [Version 2 - Canary]
(current) (new)

How it works:

  1. Deploy the new version alongside the current one
  2. Route 5% of traffic to the new version
  3. Monitor key metrics (error rate, latency, user satisfaction, safety incidents)
  4. If metrics are healthy after a set period, increase to 25%, then 50%, then 100%
  5. If any metric degrades, route all traffic back to the current version

Maintain two identical production environments. Switch all traffic from one to the other.

Before: Traffic --> [Blue - v1.2 ACTIVE] [Green - idle]
During: Traffic --> [Blue - v1.2] [Green - v1.3 ACTIVE]

The advantage is a clean cutover and instant rollback (just switch back to Blue). The downside is you need double the infrastructure during the transition.

Route traffic to different agent versions and compare their performance on real interactions.

Version AVersion BMetricWinner
GPT-based, verbose promptsGemini-based, concise promptsTask completion rateCompare after N interactions
ReAct patternPlan-then-executeUser satisfactionCompare after N interactions
Model A, 3 tool retriesModel A, 1 tool retryCost per interactionCompare after N interactions

A/B testing is especially valuable for agents because it lets you compare different models, prompts, and architectures on real traffic.

Control agent capabilities with runtime flags that can be toggled without redeployment.

# Example: Feature flags for agent capabilities
if feature_flags.is_enabled("new_refund_flow", user_id=user.id):
agent.enable_tool("process_refund_v2")
else:
agent.enable_tool("process_refund_v1")
if feature_flags.is_enabled("extended_context_window"):
agent.set_max_context(128000)
else:
agent.set_max_context(32000)

Feature flags let you gradually roll out new capabilities, quickly disable problematic features, and run experiments on subsets of users.


Once your agent is running in production, you need to see what it is doing. Observability for agents has three pillars, just like traditional systems - but the specifics are different.

Structured logs that capture every significant event in the agent’s lifecycle:

{
"timestamp": "2025-06-15T10:23:45Z",
"session_id": "sess_abc123",
"event": "tool_call",
"tool": "search_knowledge_base",
"arguments": {"query": "return policy for electronics"},
"result_status": "success",
"latency_ms": 234,
"tokens_used": {"input": 1250, "output": 380}
}

What to log:

  • Every LLM call (model, input tokens, output tokens, latency)
  • Every tool call (tool name, arguments, result status, latency)
  • Agent decisions (which path was chosen and why)
  • Guardrail activations (what was blocked and why)
  • Escalation events
  • Session start/end with summary metrics

Traces show the full journey of a single request through your agent, including all the steps, tool calls, and decisions along the way.

[User Request] "Help me return my order"
|
+-- [LLM Call 1] Understand intent (150ms)
| Model: gemini-2.0-flash, Tokens: 800 in / 120 out
|
+-- [Tool Call] lookup_order(order_id="12345") (340ms)
| Status: success
|
+-- [Tool Call] check_return_eligibility(order_id="12345") (180ms)
| Status: success, eligible=true
|
+-- [LLM Call 2] Generate response (200ms)
| Model: gemini-2.0-flash, Tokens: 1200 in / 250 out
|
+-- [Output Guardrail] PII check (15ms)
| Status: pass
|
[Response] "Your order #12345 is eligible for return..."
Total: 885ms, Cost: $0.003

OpenTelemetry is the industry standard for distributed tracing. Many agent frameworks support OpenTelemetry out of the box, and Google Cloud’s operations suite (Cloud Logging, Cloud Trace, Cloud Monitoring) integrates with OpenTelemetry natively.

Aggregate metrics that tell you how your agent is performing overall:

MetricWhat It Tells YouAlert Threshold Example
Task completion rateHow often the agent successfully completes user requestsDrop below 85%
Average latencyHow long users wait for responsesP95 exceeds 5 seconds
Cost per interactionHow much each conversation costsAverage exceeds $0.10
Escalation rateHow often the agent hands off to humansExceeds 20%
Safety incident rateHow often guardrails are triggeredAny increase above baseline
Tool error rateHow often tool calls failExceeds 5%
User satisfactionThumbs up/down or CSAT scoresDrops below 4.0/5.0

A production agent dashboard should show at a glance:

+-------------------------------------------------------+
| Agent Health Dashboard |
+-------------------------------------------------------+
| |
| Status: HEALTHY Active Sessions: 1,247 |
| |
| +-------------------+ +-------------------+ |
| | Completion Rate | | Avg Latency | |
| | 92.3% (+0.5%) | | 1.2s (-0.1s) | |
| +-------------------+ +-------------------+ |
| |
| +-------------------+ +-------------------+ |
| | Cost / Session | | Escalation Rate | |
| | $0.042 (-$0.003) | | 8.1% (+0.2%) | |
| +-------------------+ +-------------------+ |
| |
| Recent Safety Incidents: 0 (last 24h) |
| Recent Errors: 12 (last 24h, 0.04% of sessions) |
| |
+-------------------------------------------------------+

Production is not a destination. It is the beginning of a continuous improvement cycle.

+----------+
| Observe | <-- Collect metrics, logs, traces, user feedback
+----+-----+
|
v
+----+-----+
| Act | <-- Identify issues, prioritize improvements
+----+-----+
|
v
+----+-----+
| Evolve | <-- Update prompts, tools, evals, guardrails
+----+-----+
|
+-------> Back to Observe

Collect data about how your agent performs in production:

  • Quantitative: Metrics dashboards, automated eval results on production traffic
  • Qualitative: User feedback, support tickets, conversation reviews
  • Adversarial: Ongoing red teaming, new attack pattern detection

Turn observations into concrete actions:

  • Failing on a specific type of query? Add it to your eval set and improve the prompt.
  • Tool errors spiking? Investigate the root cause and add better error handling.
  • Users consistently confused by a response pattern? Revise the agent’s instructions.
  • New attack vector discovered? Add a guardrail and a safety eval.

Deploy improvements through your evaluation-gated CI/CD pipeline:

  • Update prompts and re-run evals
  • Add new tools or modify existing ones
  • Expand the eval suite to cover newly discovered edge cases
  • Adjust guardrails based on observed threats
  • Retrain or swap models if better options become available

The key insight is that your eval suite grows over time. Every production incident, every user complaint, and every edge case becomes a new eval. This means your agent gets harder to break with each iteration.


LLM-based agents can be expensive at scale. A single conversation might involve multiple LLM calls, each consuming thousands of tokens. Multiply that by thousands of users and costs add up fast.

Use the cheapest model that can handle each task. Not every step requires your most powerful model.

User Query
|
v
[Router] --Simple query--> Gemini Flash-Lite ($)
|
+-----Medium complexity--> Gemini Flash ($$)
|
+-----Complex reasoning--> Gemini Pro ($$$)
Task TypeRecommended Model TierRationale
Intent classificationSmall / Flash-LiteSimple classification task
Information retrievalMedium / FlashNeeds good comprehension, moderate generation
Complex reasoningLarge / ProMulti-step reasoning, nuanced judgment
SummarizationMedium / FlashGood balance of quality and cost
Safety checksSmall / Flash-LitePattern matching, classification

Cache responses for repeated or similar queries to avoid redundant LLM calls.

Caching StrategyWhen to Use
Exact match cacheFAQ-style queries where many users ask the same thing
Semantic cacheQueries that are different in wording but identical in meaning
Tool result cacheTool outputs that do not change frequently (e.g., product catalog lookups)
Prompt cacheReuse cached prefixes for system prompts across calls (Vertex AI supports context caching)

Set hard limits on how many tokens an agent can consume per session.

# Example: Token budget enforcement
class TokenBudget:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.used_tokens = 0
def can_proceed(self, estimated_tokens: int) -> bool:
return (self.used_tokens + estimated_tokens) <= self.max_tokens
def record_usage(self, actual_tokens: int):
self.used_tokens += actual_tokens
# Usage
budget = TokenBudget(max_tokens=50000) # per session
while agent.has_next_step():
estimated = agent.estimate_next_step_tokens()
if not budget.can_proceed(estimated):
agent.respond("I have reached my processing limit for this session. "
"Let me summarize what I have found so far.")
break
result = agent.execute_next_step()
budget.record_usage(result.tokens_used)

Track costs at multiple levels:

LevelWhat to TrackWhy
Per requestTokens used, model tier, tool callsDebug expensive individual requests
Per sessionTotal cost of a conversationSet and enforce per-session budgets
Per userAggregate cost per user over timeIdentify usage patterns and outliers
Per featureCost of specific agent capabilitiesDecide which features are cost-effective
OverallDaily/weekly/monthly spendBudget planning and forecasting

Before launching your agent to production users, walk through this checklist:

  • Health checks and liveness probes configured
  • Graceful degradation when dependencies fail (model API down, tool unavailable)
  • Retry logic with exponential backoff for transient failures
  • Circuit breakers for external service calls
  • Timeout limits on all LLM and tool calls
  • CI/CD pipeline with eval gates at each stage
  • Rollback procedure tested and documented
  • Canary or blue-green deployment configured
  • Feature flags for new capabilities
  • Prompt versioning and change tracking
  • Structured logging for all agent events
  • Distributed tracing with OpenTelemetry
  • Dashboards for key metrics (completion rate, latency, cost, safety)
  • Alerting configured for critical thresholds
  • On-call rotation and incident response runbook
  • Model routing configured (right model for each task)
  • Caching strategy implemented
  • Token budgets per session
  • Cost monitoring and alerting
  • Regular cost reviews and optimization
  • Guardrails from Lesson 10 implemented and tested
  • Safety evals passing at 100%
  • Red team review completed
  • Incident response plan for safety failures
  • User feedback channel for reporting problems

  1. The prototype-to-production gap is real and large. Plan for production concerns from the beginning. The “last mile” is the majority of the work.

  2. Evaluation-gated deployment is non-negotiable. No agent version should reach production without passing a comprehensive eval suite. Your eval suite is your quality guarantee.

  3. CI/CD for agents has three phases. Pre-merge checks catch obvious issues fast. Post-merge validation runs broader evals. Production gates ensure safety and quality before real users are affected.

  4. Safe rollout strategies limit blast radius. Canary deployments, feature flags, and A/B testing let you catch problems before they affect all users.

  5. Observability is essential. You cannot improve what you cannot see. Invest in logs, traces, and metrics from day one.

  6. Cost management requires active attention. Model routing, caching, and token budgets can reduce costs dramatically without sacrificing quality.

  7. Production is the beginning, not the end. The Observe-Act-Evolve loop means your agent continuously improves based on real-world usage.



Next lesson: Getting Started with Vertex AI and ADK