Agent Reliability Is a Loop, Not a Dashboard
Observability shows the $18,400 cloud bill. It doesn't stop the agent. Why AI agent reliability in production is a loop, not a dashboard.
- agent-reliability
- runtime-firewall
- observability
- tool-calls
2 a.m. The pager goes off. A deployment agent ran terraform apply against the prod workspace instead of staging and spun up a 32-node GPU cluster. The trace is captured. The eval suite from last week's deploy is green. The dashboard renders the failure beautifully — span tree, token counts, latency histogram, the lot.
The machines are running. The bill is already accruing — it will land at $18,400 by morning.
Observability told us that it happened. It did not stop it from happening. The trace is a really nice autopsy. The on-call engineer doesn't care about the autopsy at 2 a.m.
This is not an isolated scene. Mike Lanham gave the pattern a name and called it a discipline: agent reliability engineering, SRE for systems that act instead of just respond. The pattern lands hardest where the agent has the most blast radius — the deploy autopilots taking over GenAI infra management, where a single tool call can spin up a cluster, drop a database, or push to main.
Every popular reliability strategy in 2026 has the same architectural property as that dashboard: it sits somewhere other than on the wire between the model deciding to act and the action actually happening. That property is the bug. The rest of this post is about where the bug lives, what the missing primitive is, and why a single primitive isn't enough — reliability is a loop, and the loop has to close.
Five strategies, one architectural bug: they all fire after the action
Maxim's recent piece on agent reliability lays out the five strategies teams reach for: observability, eval frameworks, simulation environments, error handling, and governance. They cite the now-familiar number — 70 to 85 percent of AI agent deployments fail to deliver on production goals.
The framing has hardened into industry consensus. Galileo's reliability writeup converges on the same five-strategy shape — observability, evals, governance, drift detection, simulation. Requesty arrives at it from a different direction and argues the missing piece is routing and provider failover layered on top of the same telemetry primitives. Different vendors, same family of answers. The diagnosis is right. The five strategies are real practice. We use most of them ourselves.
What none of these lists name is where on the request path each strategy lives — which is the question the rest of this post is about.
Look at where each one runs, though.
- Observability runs after the tool call. It tells you the cluster is up. It cannot un-spin it.
- Eval frameworks run before the agent ever shipped, on a curated dataset. The prompt that wandered into the prod workspace at 2 a.m. wasn't in your eval set.
- Simulation environments run pre-prod against synthetic inputs. Production has real workspaces with real credentials.
- Error handling is retry logic. You can't retry a
terraform apply. - Governance is policy on paper. Policy doesn't sit on the wire.
There is one moment that none of these touch: the moment between the LLM emitting tool_use and the tool actually executing. Post-action telemetry tells you what happened. Pre-deployment evals tell you what would happen on a frozen dataset. Neither stops a bad call in production. The missing strategy is at-the-boundary enforcement, and it's missing because nobody had a synchronous primitive that fit there.
That's the gap. Everything else is downstream of it.
The missing primitive: enforcement at the tool-call boundary
The boundary is a specific place in your code. The LLM has emitted a tool_use block. Your runtime has parsed it. The tool function — shell exec, file write, git push, cloud API call, deploy — has not yet been called. That window is the last point at which "would do" is still distinguishable from "did do."
What lives there, in shape, is a synchronous evaluator. It receives (tool_name, tool_input, trace_context). It returns one of two answers: allow the call, or block it. (A third answer — modify the input — exists in the type system but no production runner emits it today.) The evaluator runs on the same code path as the tool call itself. If it doesn't return in time, you choose: fail open and let the call go, or fail closed and refuse. Both are real, defensible choices. The point is that the choice exists at all. Without the boundary primitive, there is no choice — the action just happens.
What you get from not having this: every reliability investment becomes archeology. Observability tells you the agent ran terraform apply and the bill is $18,400. Heuristics in your prompt tell the model to be careful. Retries fire on transient errors. None of them prevent the side effect, because the side effect is already in the world by the time any of them run.
A runaway terraform apply is the load-bearing example, but the shape generalizes: any tool call that mutates the world — executes a shell command, writes a file, pushes to a branch, hits a cloud API, triggers a deploy — has the same boundary, and the same gap.
A loop only works if it closes: enforce → trace → diagnose → re-enforce
OK. So put a wall at the boundary. Done?
No. A static wall goes stale in a week. Your agent grew a new tool. A prompt change shifted its behavior on a class of inputs you'd never seen. A user started driving the agent through a path your rules don't cover. The wall keeps holding the line on yesterday's failures and lets today's failures through. Enforcement without a feedback loop is a snapshot, and snapshots rot.
Equally: tracing without enforcement is a graveyard. You get gorgeous spans of every incident you couldn't prevent. Engineers spend the morning explaining the postmortem in beautiful detail. Nothing changes by Friday.
The shape that actually holds is a loop with four arrows:
Enforce at the boundary. Block the obviously-bad call before it happens.
Trace what the agent did, what it tried to do, and what got blocked. Every blocked call is a labeled failure example.
Diagnose patterns in the trace + violation stream. Cluster the misses. Find the new failure modes. Identify where the wall is wrong — too tight, too loose, wrong scope.
Re-enforce with updated rules. Push them back to the boundary. The wall now reflects what production actually does, not what you thought it would do at design time.
Drop any one arrow and the system regresses to one of the broken states above. No enforce, you have a graveyard. No trace, you're enforcing on yesterday's intuition. No diagnose, your trace pile grows but no rule ever changes. No re-enforce, your diagnosis lives on a Notion page.
Firstsource lands at the same conclusion from the opposite direction. They walk through the past two years of agent benchmark gains and ask why production teams aren't feeling it — the answer they reach is that AI agents keep getting smarter without getting more reliable, because the failure modes that hurt in production live outside the model. That's the same gap, named from the model side. Better models don't close it because the gap doesn't live inside the model — it lives in the four arrows. The four arrows aren't a process diagram. They're an architecture. Each one needs a real component on the wire, owned together so the loop actually closes instead of fragmenting across four vendors that don't talk.
Where Staso is on the loop today, and what's shipping next
Concretely, what we ship and what's coming.
Live today. Three of the four arrows.
Trace. Python SDK on PyPI (pip install staso). Drop-in integrations for Anthropic, OpenAI, LiteLLM, Claude Code, and Codex — three lines and your existing agent emits traces. Monitoring dashboards, conversations, cost tracking. Org RBAC with custom roles. Free tier at 10K traces per month.
Enforce. The runtime firewall is on the wire today. The SDK calls a synchronous evaluator (POST /v1/guard/evaluate) before each tool runs; in enforce mode it raises GuardBlocked and emits a guard:blocked:<tool> span on the trace. Zero-config rules ship in the box — destructive shell ops (rm -rf, terraform apply, force-push), secret-bearing paths (.aws/credentials, .ssh/id_rsa, .env.production), network egress patterns, prompt injection, jailbreak. They are opt-in: you create a policy in the dashboard, link the rules you want, set mode to enforce or audit. Custom rules in plain English live in the same surface.
Diagnose. Self-heal runs on every failure surface — manual ("What happened here?" on any trace), automatic on guard violations, or scheduled. A deterministic engine catches the easy patterns first (regression after commit, model drift, token bloat); below 0.85 confidence it escalates to an agent loop with read_file / bash / clone_repo in a sandbox. Output is a root-cause summary, explanation, suggested fix, confidence, tags — streamed live to the trace view. With a GitHub connection, Fix turns the suggested fix into an actual PR you review and merge.
Shipping next (~2-3 weeks). Re-enforce. The fourth arrow. Today the diagnosis lands in the dashboard and a human turns it into a new rule. Re-enforce closes that gap: a recurring diagnosis pattern auto-proposes a guard rule, you accept or reject in one click, and it lands on the wire for the next tool call. Every blocked action and every diagnosis feeds a per-team failure pattern database, so the rules you re-enforce are the ones that actually matched your production.
We are deliberate about this order. Trace first, because you cannot enforce what you cannot see. Enforce second, because rules without a wall are documentation. Diagnose last, because diagnosis only works on top of a corpus of real, labeled failures — which the first two arrows generate.
Most reliability tools sit on one arrow. We are building the loop.
Three lines to close the loop on your agent
Show, don't sell. If you have an agent calling Anthropic in production, three lines gets you trace, enforce, and the dashboard today:
pip install stasoimport staso as st
st.init(api_key="...", agent_name="my-agent")
from staso.integrations import patch_anthropic
patch_anthropic()That's it. Every Anthropic call from your process now emits a trace. The dashboard renders it. Cost, latency, tool calls, conversation grouping all light up. Open the Guard tab, attach a policy with the rules you want, and the runtime firewall starts blocking on the next call. When something does fail, "What happened here?" on the trace runs self-heal end to end — root cause, suggested fix, optional PR.
Re-enforce ships in 2-3 weeks on the same SDK on the same install. No re-integration. The runtime firewall, the dashboard, and self-heal are already on the wire — re-enforce is the arrow that closes the loop, turning recurring diagnoses into one-click guard rules.
Free tier covers 10K traces per month. Docs at staso.ai/docs.
If you've ever read a beautiful trace of an $18,400 mistake — or a force-pushed branch, or a rm -rf in the wrong directory — the wall is in, and self-heal will tell you why when something does slip. Three lines and a policy. The re-enforce arrow is what we're building next.
Get the next post in your inbox.
One email when we publish. Postmortems, failure classes, and the instrumentation we wish more agent teams shipped with.