Datasets built from what your agents actually did.
Curate a failing trace into an eval row in one click. Grow the dataset from CSV, splits, or the Python SDK. Keep the link back to the trace that produced it.
A failing trace is already half a test case.
Pick the spans that matter, confirm the column mapping, the rows land. The agent work is done — dataset curation is paperwork.
Open a failing trace
From a trace or a full conversation, click Add to dataset. No copy-paste. No dump-to-JSON.
Confirm the mapping
Pick span kinds (llm, tool, agent, chain). The smart extractor auto-maps input, output, tool calls, latency, status, and model to columns.
Rows land with receipts
Every entry is stamped with the source trace and span id. You can always get back to where the row came from.
Feels like a notebook. Ships like a database.
Everything you need to keep a growing eval set honest. Nothing you don't.
Not just input / output.
Input, output, expected output, tools, steps, scenario, files, conversation history, variable, custom. Shape the dataset for the test you need.
Move like a notebook.
Copy-paste TSV from Excel. Arrow keys, enter, esc. Add, reorder, and delete columns without a dialog.
No save button.
Cell edits persist as you type. The whole sheet stays consistent across tabs on the same dataset.
Train / test / validation.
Name your splits, assign manually, or auto-split by percentage. Export any split to CSV independently.
Bring your existing set.
Import a CSV and missing columns get auto-created as variables. Export the whole dataset or just one split.
Start shaped.
Five templates — prompt testing, tool testing, agent testing, simulation, custom — seed the right columns before you add a row.
Organize at scale.
Group datasets into folders. Duplicate a dataset when you need a variant.
Every row has a home.
source_type and source_ref are persisted with curated rows. One click back to the trace that produced the entry.
Curate and evaluate from code.
st.dataset.from_traces pulls rows server-side. st.dataset.evaluate runs your agent against the dataset in-process and returns a summary. Everything the dashboard does, the SDK does too.
01import staso as st0203# Curate from traces you've already captured04st.dataset.from_traces(05dataset_id="ds_01JB4...",06trace_ids=["t_01JB4...", "t_01JB4..."],07span_kinds=["llm", "agent"],08)0910# Run your agent against the dataset, client-side11summary = st.dataset.evaluate(12dataset_id="ds_01JB4...",13fn=my_agent,14split="test",15max_concurrency=4,16)
Curation today. Grading and runs next.
What's not in the product today — and what we're building toward.
Server-side eval runs
SoonKick off a run from the dashboard, not just the SDK. Persisted history, compare results across agent versions.
LLM-judge grading
SoonScore actual vs expected output with an LLM judge. Calibrated, not hand-rolled.
Regression mode
SoonReplay the same dataset at two agent commits and see what moved. Tied into the Self-Heal fix flow.
Synthetic generation
SoonExtend a sparse dataset with LLM-generated rows. Schema is in the API; the endpoint wires up next.
Your next eval starts with a trace.
Pick a failing run. Add it to a dataset. Grow from there.