Verdicts
Once a run completes, every trace × rule combination produces one verdict.
verdicts = st.evals.verdicts(run.id)
for v in verdicts:
if not v.passed:
print(v.rule_name, v.trace_id, v.score, "—", v.reason)Filter
passed=True and passed=False filter server-side. Use this to triage:
fails = st.evals.verdicts(run.id, passed=False, limit=200)Pagination uses limit (max 1000) and offset.
Verdict shape
@dataclass(frozen=True)
class EvalVerdict:
run_id: str
trace_id: str
guard_rule_id: str # empty for zero-config evals
rule_name: str # e.g. "task_completion_rate" or your custom rule's name
runtime: str # "programmatic" | "llm_judge" | "agentic" | "prompt"
passed: bool
score: float # 0.0–1.0 for passes, often >1 when the rule emits a magnitude
reason: str
agent_version: str
latency_ms: int
timestamp: datetime
inspected: tuple[InspectedStep, ...] # Agent Judge tool-call trace; empty otherwisereason is the rule's own explanation. For LLM judge and agentic rules, this is the model's free text — useful for triage but not for parsing. For programmatic rules, it's whatever string the rule's evaluate(payload) returned.
What the Agent Judge inspected
For agentic rules, every verdict carries inspected — the ordered tool calls the judge made while reading the agent-run files, so a pass/fail is auditable instead of a black box. Empty for every other runtime, and empty when a deterministic precheck failed before the agent ran.
@dataclass(frozen=True)
class InspectedStep:
turn: int # 1-based turn in the judge's inspection loop
tool: str # "read_file" | "bash" | "get_files" | ...
target: str # the salient arg: file path, command, or glob
for v in st.evals.verdicts(run.id):
for step in v.inspected:
print(f" turn {step.turn}: {step.tool} {step.target}")Summary
st.evals.summary(run.id) aggregates the verdicts for the whole run and per-rule:
summary = st.evals.summary(run.id)
print("overall:")
print(f" {summary.passed}/{summary.total} ({summary.pass_rate:.1%})")
print(f" avg score {summary.avg_score:.2f}, avg latency {summary.avg_latency_ms:.0f}ms")
print("by rule:")
for r in summary.by_rule:
print(f" {r.rule_name:<30} {r.passed}/{r.total} ({r.pass_rate:.1%}) avg {r.avg_score:.2f}")avg_score and avg_latency_ms are over passed and failed verdicts both. Pass rate is passed / total.
What "passed" means
Every rule produces a binary passed and a score. The threshold for passed is rule-defined, not a global cutoff. Examples from the zero-config catalog:
task_completion_rate—passed=Trueifftrace.status == "ok".latency_p95—passed=Truewhenduration_ms <= 30000. Score isduration_ms / 30000.cost_per_task—passed=Truewhentotal_tokens <= 50000. Score escalates with cost.hallucination(LLM judge) — judge model decides; score is the judge's confidence.
For your own rules, the verdict envelope you return decides — see Custom rules.
Drill into a verdict
Verdicts on the dashboard link directly to the source trace and the heal sandbox. To do the same from code:
trace_url = f"https://staso.ai/observability/traces/{v.trace_id}"
heal_url = f"https://staso.ai/heal?trace_id={v.trace_id}"