Staso Docs
Datasets

Evaluate a Dataset

Run any Python function against every entry in a dataset and score the results.

import staso as st

st.init(api_key="...", workspace_slug="...")

def exact_match(entry, output):
    return float(output == entry["expected"])

def run_agent(entry):
    return my_agent(entry["input"])

summary = st.dataset.evaluate(
    ds.id,
    run_agent,
    scorers=[exact_match],
    max_concurrency=4,
)
print(summary.total, summary.passed, summary.failed)

Signature

st.dataset.evaluate(
    dataset_id: str,
    fn: Callable[[dict], Any],
    *,
    scorers: list[Callable[[dict, Any], float]] | None = None,
    split: str | SplitType | None = None,
    max_concurrency: int = 1,
    trace: bool = True,
) -> EvalSummary
  • fn — your agent wrapper. Takes the entry's data dict and returns whatever your agent produced.
  • scorers — zero or more scoring functions. Each takes (entry_data, output) and returns a float between 0.0 and 1.0.
  • split — restrict the run to one SplitType (or its string value).
  • max_concurrency — number of entries to run in parallel. Default 1.
  • trace — when True (the default), every entry run is emitted as a nested trace you can open in the Staso dashboard.

Scorer contract

A scorer is a plain function:

def exact_match(entry: dict, output) -> float:
    return float(output == entry["expected"])

def contains_keyword(entry: dict, output) -> float:
    return 1.0 if "refund" in str(output).lower() else 0.0

Return 1.0 for a pass, 0.0 for a fail, or anything in between for partial credit. The runner aggregates the mean per scorer into EvalSummary.scores.

EvalSummary

summary = st.dataset.evaluate(ds.id, run_agent, scorers=[exact_match])

summary.dataset_id       # str
summary.dataset_name     # str
summary.total            # int  — total entries attempted
summary.passed           # int  — entries where every scorer returned >= 0.5
summary.failed           # int  — entries where any scorer returned < 0.5
summary.error_count      # int  — entries where fn raised
summary.avg_duration_ms  # float
summary.scores           # dict[str, float] — mean per scorer
summary.results          # tuple[EvalResult, ...]

Each EvalResult in summary.results carries entry_id, input_data, expected, actual, scores, passed, error, trace_id, duration_ms. When trace=True, trace_id is a direct link target in the dashboard.

Next