Datasets

Turn production traces into versioned eval datasets, run your agent against them, and score the results.

import staso as st

st.init(workspace_slug="...")

ds = st.dataset.create("refund-edge-cases", description="Tricky refund conversations")
st.dataset.add_entry(ds.id, {"input": "I want a refund for order 42", "expected": "refund_issued"})

def run_agent(entry):
    return my_agent(entry["input"])

def exact_match(entry, output):
    return float(output == entry["expected"])

summary = st.dataset.evaluate(ds.id, run_agent, scorers=[exact_match])
print(summary.passed, "/", summary.total)

Why datasets

A loose tests.csv rots. Staso datasets are versioned, org-scoped, and tied to the real failure patterns in your production traces. Curate from traces you already ship, freeze them as splits, re-run them every prompt or model change.

Two kinds

A dataset's kind decides how its rows are stored and scored:

`kind`	A row is…	Created with	Scored by
`tabular`	a record (`input`, `expected`, …)	the SDK or CSV import	your scorers, or `prompt`/`programmatic`/`llm_judge` rules
`agent_run`	a file bundle — one agent run's outputs	the dashboard (folder/`.zip` upload)	Agent Judge

Everything below is about tabular datasets — the kind the SDK creates. agent_run datasets hold the multi-file outputs of complex agents and are managed in the dashboard; the SDK can read and run evals against them but cannot upload bundles.

What you can do

Curate from traces — st.dataset.from_traces(...).
Import / export CSV — upload_csv(...) / download_csv(...).
Evaluate with scorers — any Python function over every entry.
Generate synthetic data — st.dataset.generate(...) to grow an existing dataset.

Manage datasets
Curate from traces
Evaluate
Import & generate
Agent Judge — score agent_run datasets of file bundles.

Datasets

Why datasets

Two kinds

What you can do

Next

On this page