Four products. One loop. Compounding from day one.
One guard violation, one click, becomes an eval rule that scores every future run. A failing verdict opens self-heal at the failing commit. Same schema, same authoring surface, same lineage — single-product competitors can't catch the compounding.
Trace → guard → eval → self-heal.
Four arrows, one canonical direction. Every arrow is one click on the dashboard, one method on the SDK. Nothing custom to wire, nothing to remember.
Run lands
Spans, tool calls, and conversation, captured automatically by the SDK or CLI.
Block in real time
Static and LLM judges enforce at the tool boundary, in under 50ms, before a bad action ships.
Score after the fact
Async judges read each run and return a pass, a fail, a score, and the reason.
Diagnose root cause
A sandbox replays the failing run and returns a root-cause diagnosis.
The receipts.
The arrows aren't marketing. Every product produces data another product reads — same schema, same auth, same UI patterns.
Every run, every span, every tool call.
Feeds: guard inputs, eval scope, heal evidence.
Spans, tool calls, conversation, artifacts. One SDK call per agent, or one CLI install for Claude Code and Codex. Nothing else to wire.
Block at the tool boundary.
Feeds: eval rule candidates, heal triggers.
Static + LLM judges + your custom rules — sub-50ms, before the tool fires. A blocked call writes a guard_violation; the rule that fired is one click from being ported to eval.
Score after the fact.
Feeds: guard rule promotions, drift detection, heal targets.
Async judges read the trace, the artifacts, the conversation. A failing verdict is one click from a guard rule (now blocking real time) and one click from a heal run (now diagnosing root cause).
Diagnose root cause. Open the PR.
Feeds: learned patterns, future rule candidates.
Sandbox runs read your repo at the failing commit, plus pre-loaded eval and guard history. Output is a root cause and a fix PR — your CI stays in charge of merge.
Three structural reasons. Not promises.
Same data plane.
Spans, rules, verdicts, violations, artifacts — one schema, one auth boundary. Nothing to ETL. Nothing to keep in sync.
Same authoring surface.
Write a rule once. Run it as guard (real-time block), as eval (offline judge), or as both. Promote either direction without re-implementing the logic.
Same lineage everywhere.
Every rule shows where it came from (eval verdict, guard violation, manual). Every verdict links back to the failing trace. Every fix PR cites the violation it closed.
Your first three weeks.
The loop card on your dashboard tracks the same four arrows in live numbers — traces, verdicts, diagnoses, blocks — for your own org. No demo data, no canned screenshots.
Trace + first verdicts
Install the SDK, attach a zero-config eval, see verdicts arrive within minutes of your first agent run.
First promoted rule
Pick an eval that consistently catches a real failure, promote the rule to guard, watch it block in real time.
Compounding
Every blocked tool call writes a violation, every violation is one click from a sharper rule, the loop tightens itself.
One install. The whole loop.
Wire the SDK in a minute. Attach a rule. Verdicts arrive. Promote one to guard. Watch the blocks ship. The compounding starts on day one — not after a quarter of integration.