Datasets
Datasets Overview
Turn production traces into versioned eval datasets, run your agent against them, and score the results.
import staso as st
st.init(api_key="...", workspace_slug="...")
ds = st.dataset.create("refund-edge-cases", description="Tricky refund conversations")
st.dataset.add_entry(ds.id, {"input": "I want a refund for order 42", "expected": "refund_issued"})
def run_agent(entry):
return my_agent(entry["input"])
def exact_match(entry, output):
return float(output == entry["expected"])
summary = st.dataset.evaluate(ds.id, run_agent, scorers=[exact_match])
print(summary.passed, "/", summary.total)Why datasets
Eval datasets stop your test harness from rotting. Instead of a loose tests.csv that nobody updates, Staso datasets are a versioned, org-scoped source of truth tied to the real failure patterns in your production traces. You curate directly from the traces you already ship, freeze them as test splits, and re-run them every time you change a prompt or a model.
What you can do
- Curate from traces — build a dataset from real trace IDs with
from_traces(...). - Import and export CSV —
upload_csv(...)anddownload_csv(...)for round-tripping with spreadsheets or git. - Evaluate with scorers — run any Python function over every entry and score the output.
- Generate synthetic data — grow an existing dataset with
generate(...).
Plan limits
Datasets are not available on the free (no_plan) tier. Upgrade to unlock.
| Plan | Datasets / org | Entries / dataset | Columns / dataset |
|---|---|---|---|
| Personal | 3 | 500 | 20 |
| Team | 30 | 10,000 | 50 |
| Enterprise | Unlimited | Unlimited | Unlimited |
See /docs/pricing for full details.