Staso Docs
Datasets

Evaluate

def exact_match(entry, output):
    return float(output == entry["expected"])

def run_agent(entry):
    return my_agent(entry["input"])

summary = st.dataset.evaluate(
    ds.id,
    run_agent,
    scorers=[exact_match],
    max_concurrency=4,
)
print(summary.total, summary.passed, summary.failed)

Signature

st.dataset.evaluate(
    dataset_id: str,
    fn: Callable[[dict], Any],
    *,
    scorers: list[Callable[[dict, Any], float]] | None = None,
    split: str | SplitType | None = None,
    max_concurrency: int = 1,
    trace: bool = True,
) -> EvalSummary
ParameterDescription
fnYour agent wrapper. Takes the entry's data dict, returns whatever your agent produced.
scorersFunctions of (entry_data, output) -> float. Return 1.0 for pass, 0.0 for fail, or anything in between.
splitRestrict to one SplitType.
max_concurrencyEntries to run in parallel. Default 1.
traceEmit each entry run as a nested trace. Default True.

Scorer

def contains_keyword(entry: dict, output) -> float:
    return 1.0 if "refund" in str(output).lower() else 0.0

The runner aggregates the mean per scorer into EvalSummary.scores.

EvalSummary

FieldType
dataset_idstr
dataset_namestr
totalint
passedint — every scorer >= 0.5
failedint — any scorer < 0.5
error_countintfn raised
avg_duration_msfloat
scoresdict[str, float] — mean per scorer
resultstuple[EvalResult, ...]

Each EvalResult carries entry_id, input_data, expected, actual, scores, passed, error, trace_id, duration_ms. With trace=True, trace_id links to the trace in the dashboard.

Next