Eval
Eval
Eval scores traces after the fact. Pick a rule (built-in or custom), point it at a trace, a session, an agent, or a time range — get a verdict per trace.
import staso as st
st.init(api_key="ak_...", agent_name="checkout-agent")
run = st.evals.run(
scope={"session_id": "run-2026-05-10-001"},
zero_configs=["task_completion_rate", "hallucination"],
name="run-001 eval",
)
# poll until done
while st.evals.get(run.id).status not in ("completed", "failed", "cancelled"):
pass
summary = st.evals.summary(run.id)
print(summary.passed, "/", summary.total, "—", summary.pass_rate)Why Eval
Guard catches problems before a tool fires. Eval catches problems after a run ends — the kind that need the full output to judge: did the agent finish the task, was the answer right, are the produced files complete, did the new prompt regress.
Eval and Guard share rule storage. A custom rule you wrote for Guard can also score eval runs. Promote an eval-only rule to Guard the moment a pattern is worth blocking in real time.
What ships
- 9 zero-config rules
- Programmatic —
task_completion_rate,error_rate,latency_p95,cost_per_task,tool_sequence_anomaly. - LLM judge —
hallucination,false_completion,sentiment,escalation.
- Programmatic —
- Custom rules — author in the dashboard.
prompt(LLM judge with your instruction),programmatic(your Python),agentic(Agent Judge — a fresh agent inspects the files of each agent-run dataset row against your rubric, gated by deterministic checks). - Five scopes —
trace_id,session_id,dataset_id,agent_id(withsample_pct),time_range(withsample_pct).
Next
- Runs — create + poll runs, all scope shapes.
- Verdicts — per-trace verdicts and per-run summaries.
- Agent Judge — score multi-file agent runs with a fresh agent + deterministic gate.
- Compare — A/B two runs, see flips.