Staso Docs
Eval

Compare

st.evals.compare(run_a_id, run_b_id) is the answer to "did the prompt change help."

v1 = st.evals.run(
    scope={"agent_id": "support-agent", "sample_pct": 10},
    zero_configs=["task_completion_rate", "hallucination", "sentiment"],
    name="v1 baseline",
    agent_version="v1",
)

v2 = st.evals.run(
    scope={"agent_id": "support-agent", "sample_pct": 10},
    zero_configs=["task_completion_rate", "hallucination", "sentiment"],
    name="v2 candidate",
    agent_version="v2",
    compare_with=v1.id,
)

# … wait for both to complete …

cmp = st.evals.compare(v1.id, v2.id)

for r in cmp.per_rule:
    arrow = "↑" if r.delta_pass_rate > 0 else "↓" if r.delta_pass_rate < 0 else "·"
    print(f"  {arrow} {r.rule_name:<30} {r.a_pass_rate:.1%}{r.b_pass_rate:.1%}")

print(f"coverage: both={cmp.coverage.both} a_only={cmp.coverage.a_only} b_only={cmp.coverage.b_only}")
print(f"flips: {len(cmp.flips)}")

Per-rule deltas

cmp.per_rule is a list of EvalCompareRuleDelta. For each rule that ran in either run, you get pass-rate, average score, and the deltas (B − A):

@dataclass(frozen=True)
class EvalCompareRuleDelta:
    rule_name: str
    runtime: str
    a_total: int
    b_total: int
    a_passed: int
    b_passed: int
    a_pass_rate: float
    b_pass_rate: float
    delta_pass_rate: float
    a_avg_score: float
    b_avg_score: float
    delta_avg_score: float

Sort by abs(delta_pass_rate) to surface the biggest movers.

Coverage

cmp.coverage answers "are these two runs even comparable":

FieldMeaning
bothTrace IDs that both runs evaluated. The honest comparison set.
a_onlyTraces only A saw. Either A used a wider scope, or sampling diverged.
b_onlyMirror of a_only.

If both is small, the comparison is noisy. Re-run with the same scope and the same sample_pct to fix this — sampling is deterministic, so identical scopes hit identical traces.

Flips

cmp.flips lists traces that the same rule judged differently between the runs.

for f in cmp.flips:
    direction = "regressed" if f.a_passed and not f.b_passed else "fixed"
    print(f"  {direction:<10} {f.rule_name:<30} {f.trace_id}  {f.a_score:.2f}{f.b_score:.2f}")

Fields:

@dataclass(frozen=True)
class EvalCompareFlip:
    trace_id: str
    rule_name: str
    a_passed: bool
    b_passed: bool
    a_score: float
    b_score: float

Flips are the highest-signal output of a compare. Open the trace, see what changed.

Pattern

For prompt or model changes, the recipe is:

  1. Run baseline against a representative scope + sample_pct.
  2. Ship the change. Set agent_version on st.init so new traces are tagged.
  3. Run the candidate against the same scope + sample_pct.
  4. Compare. Triage flips.

For continuous regression checks, attach the rules to the agent in the dashboard with trigger_mode=on_session_end — every session feeds an evaluated bucket per agent_version automatically. The drift card on the agent detail page renders the same compare view across versions.