Evaluations

Datasets built from what your agents actually did.

Curate a failing trace into an eval row in one click. Grow the dataset from CSV, splits, or the Python SDK. Keep the link back to the trace that produced it.

01 · CURATE FROM TRACES

A failing trace is already half a test case.

Pick the spans that matter, confirm the column mapping, the rows land. The agent work is done — dataset curation is paperwork.

01find

Open a failing trace

From a trace or a full conversation, click Add to dataset. No copy-paste. No dump-to-JSON.

02map

Confirm the mapping

Pick span kinds (llm, tool, agent, chain). The smart extractor auto-maps input, output, tool calls, latency, status, and model to columns.

03persist

Rows land with receipts

Every entry is stamped with the source trace and span id. You can always get back to where the row came from.

02 · THE SPREADSHEET

Feels like a notebook. Ships like a database.

Everything you need to keep a growing eval set honest. Nothing you don't.

10 column types

Not just input / output.

Input, output, expected output, tools, steps, scenario, files, conversation history, variable, custom. Shape the dataset for the test you need.

keyboard-first

Move like a notebook.

Copy-paste TSV from Excel. Arrow keys, enter, esc. Add, reorder, and delete columns without a dialog.

auto-save

No save button.

Cell edits persist as you type. The whole sheet stays consistent across tabs on the same dataset.

splits

Train / test / validation.

Name your splits, assign manually, or auto-split by percentage. Export any split to CSV independently.

csv in / out

Bring your existing set.

Import a CSV and missing columns get auto-created as variables. Export the whole dataset or just one split.

templates

Start shaped.

Five templates — prompt testing, tool testing, agent testing, simulation, custom — seed the right columns before you add a row.

folders

Organize at scale.

Group datasets into folders. Duplicate a dataset when you need a variant.

provenance

Every row has a home.

source_type and source_ref are persisted with curated rows. One click back to the trace that produced the entry.

03 · PYTHON SDK

Curate and evaluate from code.

st.dataset.from_traces pulls rows server-side. st.dataset.evaluate runs your agent against the dataset in-process and returns a summary. Everything the dashboard does, the SDK does too.

scripts/evaluate.pypython
01import staso as st
02 
03# Curate from traces you've already captured
04st.dataset.from_traces(
05 dataset_id="ds_01JB4...",
06 trace_ids=["t_01JB4...", "t_01JB4..."],
07 span_kinds=["llm", "agent"],
08)
09 
10# Run your agent against the dataset, client-side
11summary = st.dataset.evaluate(
12 dataset_id="ds_01JB4...",
13 fn=my_agent,
14 split="test",
15 max_concurrency=4,
16)
04 · ON THE ROADMAP

Curation today. Grading and runs next.

What's not in the product today — and what we're building toward.

  • Server-side eval runs

    Soon

    Kick off a run from the dashboard, not just the SDK. Persisted history, compare results across agent versions.

  • LLM-judge grading

    Soon

    Score actual vs expected output with an LLM judge. Calibrated, not hand-rolled.

  • Regression mode

    Soon

    Replay the same dataset at two agent commits and see what moved. Tied into the Self-Heal fix flow.

  • Synthetic generation

    Soon

    Extend a sparse dataset with LLM-generated rows. Schema is in the API; the endpoint wires up next.

05 · Get started

Your next eval starts with a trace.

Pick a failing run. Add it to a dataset. Grow from there.