Datasets
Datasets
Turn production traces into versioned eval datasets, run your agent against them, and score the results.
import staso as st
st.init(workspace_slug="...")
ds = st.dataset.create("refund-edge-cases", description="Tricky refund conversations")
st.dataset.add_entry(ds.id, {"input": "I want a refund for order 42", "expected": "refund_issued"})
def run_agent(entry):
return my_agent(entry["input"])
def exact_match(entry, output):
return float(output == entry["expected"])
summary = st.dataset.evaluate(ds.id, run_agent, scorers=[exact_match])
print(summary.passed, "/", summary.total)Why datasets
A loose tests.csv rots. Staso datasets are versioned, org-scoped, and tied to the real failure patterns in your production traces. Curate from traces you already ship, freeze them as splits, re-run them every prompt or model change.
What you can do
- Curate from traces —
st.dataset.from_traces(...). - Import / export CSV —
upload_csv(...)/download_csv(...). - Evaluate with scorers — any Python function over every entry.
- Generate synthetic data —
st.dataset.generate(...)to grow an existing dataset.