Eval

Eval scores traces after the fact. Pick a rule (built-in or custom), point it at a trace, a session, an agent, or a time range — get a verdict per trace.

import staso as st

st.init(api_key="ak_...", agent_name="checkout-agent")

run = st.evals.run(
    scope={"session_id": "run-2026-05-10-001"},
    zero_configs=["task_completion_rate", "hallucination"],
    name="run-001 eval",
)

# poll until done
while st.evals.get(run.id).status not in ("completed", "failed", "cancelled"):
    pass

summary = st.evals.summary(run.id)
print(summary.passed, "/", summary.total, "—", summary.pass_rate)

Why Eval

Guard catches problems before a tool fires. Eval catches problems after a run ends — the kind that need the full output to judge: did the agent finish the task, was the answer right, are the produced files complete, did the new prompt regress.

Eval and Guard share rule storage. A custom rule you wrote for Guard can also score eval runs. Promote an eval-only rule to Guard the moment a pattern is worth blocking in real time.

What ships

9 zero-config rules
- Programmatic — task_completion_rate, error_rate, latency_p95, cost_per_task, tool_sequence_anomaly.
- LLM judge — hallucination, false_completion, sentiment, escalation.
Custom rules — author in the dashboard. prompt (LLM judge with your instruction), programmatic (your Python), agentic (Agent Judge — a fresh agent inspects the files of each agent-run dataset row against your rubric, gated by deterministic checks).
Five scopes — trace_id, session_id, dataset_id, agent_id (with sample_pct), time_range (with sample_pct).

Runs — create + poll runs, all scope shapes.
Verdicts — per-trace verdicts and per-run summaries.
Agent Judge — score multi-file agent runs with a fresh agent + deterministic gate.
Compare — A/B two runs, see flips.

Eval

Why Eval

What ships

Next

On this page