Runs
st.evals.run(...) creates a run and starts it server-side. The call returns immediately with the row; the actual evaluation happens in the background. Poll st.evals.get(run_id) until status is terminal.
run = st.evals.run(
scope={"trace_id": "abc-123"},
zero_configs=["task_completion_rate"],
name="trace abc smoke",
)
while True:
state = st.evals.get(run.id)
if state.status in ("completed", "failed", "cancelled"):
breakScopes
The scope dict says which traces to evaluate. Five shapes:
# One trace.
scope={"trace_id": "abc-123"}
# Every trace in a session.
scope={"session_id": "run-2026-05-10-001"}
# Every entry in a dataset.
scope={"dataset_id": "ds-uuid"}
# A sample of an agent's recent traces.
scope={"agent_id": "support-agent", "sample_pct": 5}
# Everything in a time range.
scope={
"time_range": {"start": "2026-05-01T00:00:00Z", "end": "2026-05-08T00:00:00Z"},
"sample_pct": 10,
}sample_pct is 1–100. Sampling is deterministic (cityHash64(trace_id) % 100 < pct) so re-runs and A/B compares hit the same traces.
A flat shape works too — {"start": "...", "end": "...", "sample_pct": 50} is normalised to the time_range form.
Rules
Pass at least one of rules (custom rule UUIDs) or zero_configs (built-in names):
run = st.evals.run(
scope={"agent_id": "checkout-agent", "sample_pct": 20},
zero_configs=["task_completion_rate", "error_rate", "latency_p95"],
rules=["7c9a…", "f33b…"],
name="weekly checkout eval",
description="Wk 19 — sweeps last 7d, 20% sample",
)Any custom rule can be cited in an eval run by id, regardless of how it's used elsewhere.
Compare while running
Pass compare_with to flag the run as the B-side of an A/B. The summary view in the dashboard renders the comparison automatically.
new = st.evals.run(
scope={"agent_id": "support-agent", "sample_pct": 10},
zero_configs=["task_completion_rate", "hallucination"],
name="prompt-v2 eval",
agent_version="v2",
compare_with=baseline_run.id, # the v1 run
)agent_version is the free-form label you pass on st.init(agent_version="v2") — every span gets it, so eval can group verdicts by version automatically.
Polling
st.evals.get(run_id) returns the live row. Status transitions: pending → running → completed | failed | cancelled.
import time
run = st.evals.run(scope={"session_id": sid}, zero_configs=["task_completion_rate"])
while True:
state = st.evals.get(run.id)
if state.status in ("completed", "failed", "cancelled"):
print(state.status, state.error_message)
break
time.sleep(2)error_message is set when the run fails (no traces matched the scope, the rule errored, the LLM judge timed out, etc.). Runs do not retry — create a new run if you need to.
Triggers
Three values are accepted on the trigger keyword:
| Value | When to use |
|---|---|
on_demand | Default. You called st.evals.run directly. |
sampled_async | The session-end worker fired this run from an attached rule. The dashboard surfaces these on /agents/[agentId]. |
backfill | A POST /v1/eval/backfill request enqueued this. |
You almost never need to set this — the SDK defaults to on_demand.
Permissions
Eval routes are gated by two permissions, MANAGE_EVAL_RULES and RUN_EVAL. The dashboard enforces them on JWT callers (your team members).
API keys skip the permission check. Anyone with an API key for the org can run evals and list verdicts. Treat eval API keys with the same care as Guard keys — scope them at the workspace level when you can.
Cancel
# from any callsite — the run is server-side
import requests, os
requests.post(
f"{os.environ['STASO_BASE_URL']}/v1/eval/runs/{run.id}/cancel",
headers={"X-API-Key": os.environ["STASO_API_KEY"]},
)There is no SDK method for cancel yet. The route exists; mid-loop honor of cancellation is not yet wired into the harness executor — the run will finish the trace it was on before stopping.