Datasets
Evaluate a Dataset
Run any Python function against every entry in a dataset and score the results.
import staso as st
st.init(api_key="...", workspace_slug="...")
def exact_match(entry, output):
return float(output == entry["expected"])
def run_agent(entry):
return my_agent(entry["input"])
summary = st.dataset.evaluate(
ds.id,
run_agent,
scorers=[exact_match],
max_concurrency=4,
)
print(summary.total, summary.passed, summary.failed)Signature
st.dataset.evaluate(
dataset_id: str,
fn: Callable[[dict], Any],
*,
scorers: list[Callable[[dict, Any], float]] | None = None,
split: str | SplitType | None = None,
max_concurrency: int = 1,
trace: bool = True,
) -> EvalSummaryfn— your agent wrapper. Takes the entry'sdatadict and returns whatever your agent produced.scorers— zero or more scoring functions. Each takes(entry_data, output)and returns afloatbetween0.0and1.0.split— restrict the run to oneSplitType(or its string value).max_concurrency— number of entries to run in parallel. Default1.trace— whenTrue(the default), every entry run is emitted as a nested trace you can open in the Staso dashboard.
Scorer contract
A scorer is a plain function:
def exact_match(entry: dict, output) -> float:
return float(output == entry["expected"])
def contains_keyword(entry: dict, output) -> float:
return 1.0 if "refund" in str(output).lower() else 0.0Return 1.0 for a pass, 0.0 for a fail, or anything in between for partial credit. The runner aggregates the mean per scorer into EvalSummary.scores.
EvalSummary
summary = st.dataset.evaluate(ds.id, run_agent, scorers=[exact_match])
summary.dataset_id # str
summary.dataset_name # str
summary.total # int — total entries attempted
summary.passed # int — entries where every scorer returned >= 0.5
summary.failed # int — entries where any scorer returned < 0.5
summary.error_count # int — entries where fn raised
summary.avg_duration_ms # float
summary.scores # dict[str, float] — mean per scorer
summary.results # tuple[EvalResult, ...]Each EvalResult in summary.results carries entry_id, input_data, expected, actual, scores, passed, error, trace_id, duration_ms. When trace=True, trace_id is a direct link target in the dashboard.