Staso Docs
Eval

Agent Judge

Some agents don't produce a single answer you can grade — they produce a folder: a report, structured findings, a state file, evidence. Agent Judge evaluates that. For each run, a fresh agent opens the files, reads them against your rubric, and returns a pass/fail with the exact tool calls it made. A layer of deterministic checks runs first, so structural failures (truncated report, mismatched counts, ungrounded findings) are caught exactly — the LLM is left to judge only what it's good at: meaning.

Use it when the unit of evaluation is a multi-file agent output, not a trace.

Do it now

Agent Judge runs are authored in the dashboard (uploading file bundles is a dashboard action — the SDK can't upload them). Once a rule and dataset exist, you can launch and read runs from code.

  1. Dashboard → Datasets → New dataset → Agent runs. This creates an agent_run dataset.
  2. Upload one bundle per run — a folder or a .zip/.tar.gz. Each upload becomes one entry (one agent run). Subfolders are preserved.
  3. Dashboard → Rules → New rule → Agent Judge. Write a rubric (what "good" means) and, optionally, deterministic prechecks (see below).
  4. Run it — from the dashboard, or from code by pointing an eval run at the dataset:
import staso as st

st.init(api_key="ak_...", agent_name="repo-audit")

run = st.evals.run(
    scope={"dataset_id": "ds-uuid"},   # the agent_run dataset
    rules=["agent-judge-rule-uuid"],   # your agentic rule
    name="audit quality sweep",
)

while st.evals.get(run.id).status not in ("completed", "failed", "cancelled"):
    pass

for v in st.evals.verdicts(run.id):
    print(v.passed, v.score, "—", v.reason)
    for step in v.inspected:           # what the judge actually read
        print(f"   turn {step.turn}: {step.tool} {step.target}")

One verdict is produced per entry (per agent run). One sandbox is spun per entry.

The hybrid gate (deterministic prechecks)

An LLM judge reads file windows and confabulates about whole-file and exhaustive properties: it misses truncated reports, mismatched counts, and findings that cite files the agent never scanned — and it false-flags complete reports as truncated. Those checks are cheap and exact in code.

So Agent Judge runs a deterministic gate first. If any precheck fails, the verdict is an immediate FAIL with the exact reason — and no LLM/sandbox is spent. If every precheck passes, the agent runs and judges semantics against your rubric.

Prechecks are a generic, per-rule spec — a JSON array you paste into the rule's Deterministic prechecks field. The schema (which files, which JSON fields) lives in your rule; the check types are fixed primitives. Leave it empty for a pure-LLM judge.

Check typePasses whenParams
file_presenta file matches the globpath
file_completethe file is non-empty and (optionally) ends with a markerpath, must_end_with?
json_validevery matched .json parsespath (e.g. **/*.json)
map_values_inevery value in a JSON object is allowedfile, path, allowed
count_equalstwo values/array-lengths across files matchleft, right (each {file, field} or {file, len})
field_any_non_emptyevery item in a JSON array has at least one of the given fields filledfile, array, fields
[
  {"type": "file_present",  "path": "report.md"},
  {"type": "json_valid",    "path": "**/*.json"},
  {"type": "map_values_in", "file": "state.json", "path": "phases", "allowed": ["completed", "skipped"]},
  {"type": "count_equals",
    "left":  {"file": "state.json",     "field": "summary.findings_count"},
    "right": {"file": "findings.json",  "len":   "findings"}},
  {"type": "field_any_non_empty", "file": "findings.json", "array": "findings", "fields": ["affected_file", "code_snippet"]}
]

These primitives carry no domain knowledge — the same set works on a security agent, a code-review agent, or a research agent; only the file/field names in the spec change. Paths match by basename too, so report.md matches out/report.md.

Reading the result

Every Agent Judge verdict carries inspected — the ordered tool calls the judge made while reading the files. A precheck failure has an empty inspected (the gate short-circuited before the agent ran); a semantic pass/fail lists every file the judge opened, so the verdict is auditable instead of a black box.

v = st.evals.verdicts(run.id)[0]
# v.passed, v.score (0.0 / 1.0), v.reason (the judge's grounded explanation)
# v.inspected -> tuple of InspectedStep(turn, tool, target)

Full verdict shape and InspectedStep fields: Verdicts.

When to use it

Use Agent Judge when…Use a trace eval when…
the output is a folder of files (report + data + evidence)the output is a trace of LLM/tool calls
"good" means consistency across files"good" means the run finished / the answer is right
you can express structural rules as prechecksa built-in or prompt/programmatic rule already fits

For trace-based evals (task_completion_rate, hallucination, your own prompt/programmatic rules), see Eval overview.

Gotchas

  • Authoring is dashboard-only. Creating the agent_run dataset, uploading bundles, and writing the rubric/prechecks happen in the dashboard. The SDK can launch a run against an existing dataset+rule and read the verdicts — it cannot upload bundles.
  • Eval-only. Agentic rules cannot be attached to a Guard policy (Guard is real-time, per tool-call; an agent inspecting a file bundle is not). Cite them in eval runs only.
  • One sandbox per entry. Each agent run is judged in its own sandbox, so cost scales with entries and counts toward your agentic eval quota. The deterministic gate is the cheap path — a precheck failure spends no sandbox.
  • run is overloaded. In an agent_run dataset, an entry is one agent run (a row of data). An eval run is one scored execution. The dataset UI labels its rows "entries" for exactly this reason.

Next

  • Verdicts — verdict shape, inspected, summaries.
  • Runs — scopes, polling, triggers.
  • Datasets — tabular vs agent-run datasets.