Compare
st.evals.compare(run_a_id, run_b_id) is the answer to "did the prompt change help."
v1 = st.evals.run(
scope={"agent_id": "support-agent", "sample_pct": 10},
zero_configs=["task_completion_rate", "hallucination", "sentiment"],
name="v1 baseline",
agent_version="v1",
)
v2 = st.evals.run(
scope={"agent_id": "support-agent", "sample_pct": 10},
zero_configs=["task_completion_rate", "hallucination", "sentiment"],
name="v2 candidate",
agent_version="v2",
compare_with=v1.id,
)
# … wait for both to complete …
cmp = st.evals.compare(v1.id, v2.id)
for r in cmp.per_rule:
arrow = "↑" if r.delta_pass_rate > 0 else "↓" if r.delta_pass_rate < 0 else "·"
print(f" {arrow} {r.rule_name:<30} {r.a_pass_rate:.1%} → {r.b_pass_rate:.1%}")
print(f"coverage: both={cmp.coverage.both} a_only={cmp.coverage.a_only} b_only={cmp.coverage.b_only}")
print(f"flips: {len(cmp.flips)}")Per-rule deltas
cmp.per_rule is a list of EvalCompareRuleDelta. For each rule that ran in either run, you get pass-rate, average score, and the deltas (B − A):
@dataclass(frozen=True)
class EvalCompareRuleDelta:
rule_name: str
runtime: str
a_total: int
b_total: int
a_passed: int
b_passed: int
a_pass_rate: float
b_pass_rate: float
delta_pass_rate: float
a_avg_score: float
b_avg_score: float
delta_avg_score: floatSort by abs(delta_pass_rate) to surface the biggest movers.
Coverage
cmp.coverage answers "are these two runs even comparable":
| Field | Meaning |
|---|---|
both | Trace IDs that both runs evaluated. The honest comparison set. |
a_only | Traces only A saw. Either A used a wider scope, or sampling diverged. |
b_only | Mirror of a_only. |
If both is small, the comparison is noisy. Re-run with the same scope and the same sample_pct to fix this — sampling is deterministic, so identical scopes hit identical traces.
Flips
cmp.flips lists traces that the same rule judged differently between the runs.
for f in cmp.flips:
direction = "regressed" if f.a_passed and not f.b_passed else "fixed"
print(f" {direction:<10} {f.rule_name:<30} {f.trace_id} {f.a_score:.2f} → {f.b_score:.2f}")Fields:
@dataclass(frozen=True)
class EvalCompareFlip:
trace_id: str
rule_name: str
a_passed: bool
b_passed: bool
a_score: float
b_score: floatFlips are the highest-signal output of a compare. Open the trace, see what changed.
Pattern
For prompt or model changes, the recipe is:
- Run baseline against a representative scope + sample_pct.
- Ship the change. Set
agent_versiononst.initso new traces are tagged. - Run the candidate against the same scope + sample_pct.
- Compare. Triage flips.
For continuous regression checks, attach the rules to the agent in the dashboard with trigger_mode=on_session_end — every session feeds an evaluated bucket per agent_version automatically. The drift card on the agent detail page renders the same compare view across versions.