Staso Docs
Eval

Verdicts

Once a run completes, every trace × rule combination produces one verdict.

verdicts = st.evals.verdicts(run.id)
for v in verdicts:
    if not v.passed:
        print(v.rule_name, v.trace_id, v.score, "—", v.reason)

Filter

passed=True and passed=False filter server-side. Use this to triage:

fails = st.evals.verdicts(run.id, passed=False, limit=200)

Pagination uses limit (max 1000) and offset.

Verdict shape

@dataclass(frozen=True)
class EvalVerdict:
    run_id: str
    trace_id: str
    guard_rule_id: str                  # empty for zero-config evals
    rule_name: str                      # e.g. "task_completion_rate" or your custom rule's name
    runtime: str                        # "programmatic" | "llm_judge" | "agentic" | "prompt"
    passed: bool
    score: float                        # 0.0–1.0 for passes, often >1 when the rule emits a magnitude
    reason: str
    agent_version: str
    latency_ms: int
    timestamp: datetime
    inspected: tuple[InspectedStep, ...]  # Agent Judge tool-call trace; empty otherwise

reason is the rule's own explanation. For LLM judge and agentic rules, this is the model's free text — useful for triage but not for parsing. For programmatic rules, it's whatever string the rule's evaluate(payload) returned.

What the Agent Judge inspected

For agentic rules, every verdict carries inspected — the ordered tool calls the judge made while reading the agent-run files, so a pass/fail is auditable instead of a black box. Empty for every other runtime, and empty when a deterministic precheck failed before the agent ran.

@dataclass(frozen=True)
class InspectedStep:
    turn: int      # 1-based turn in the judge's inspection loop
    tool: str      # "read_file" | "bash" | "get_files" | ...
    target: str    # the salient arg: file path, command, or glob

for v in st.evals.verdicts(run.id):
    for step in v.inspected:
        print(f"  turn {step.turn}: {step.tool} {step.target}")

Summary

st.evals.summary(run.id) aggregates the verdicts for the whole run and per-rule:

summary = st.evals.summary(run.id)

print("overall:")
print(f"  {summary.passed}/{summary.total}  ({summary.pass_rate:.1%})")
print(f"  avg score {summary.avg_score:.2f}, avg latency {summary.avg_latency_ms:.0f}ms")

print("by rule:")
for r in summary.by_rule:
    print(f"  {r.rule_name:<30} {r.passed}/{r.total}  ({r.pass_rate:.1%})  avg {r.avg_score:.2f}")

avg_score and avg_latency_ms are over passed and failed verdicts both. Pass rate is passed / total.

What "passed" means

Every rule produces a binary passed and a score. The threshold for passed is rule-defined, not a global cutoff. Examples from the zero-config catalog:

  • task_completion_ratepassed=True iff trace.status == "ok".
  • latency_p95passed=True when duration_ms <= 30000. Score is duration_ms / 30000.
  • cost_per_taskpassed=True when total_tokens <= 50000. Score escalates with cost.
  • hallucination (LLM judge) — judge model decides; score is the judge's confidence.

For your own rules, the verdict envelope you return decides — see Custom rules.

Drill into a verdict

Verdicts on the dashboard link directly to the source trace and the heal sandbox. To do the same from code:

trace_url = f"https://staso.ai/observability/traces/{v.trace_id}"
heal_url  = f"https://staso.ai/heal?trace_id={v.trace_id}"