Staso Docs
Guides

Datasets

Turn real agent traces into test data. Run your agent against it. Score every output. Your eval harness lives here.

Datasets are scoped to your workspace. Every call uses the workspace from st.init().

Create a Dataset

import staso as st

st.init(api_key="ak_...")

dataset = st.dataset.create(
    name="Checkout flow tests",
    description="Edge cases from production traces",
    columns=[
        {"name": "input", "type": "input"},
        {"name": "expected_output", "type": "expected_output"},
    ],
)
ParameterTypeDefaultDescription
namestrrequiredHuman-readable name
descriptionstr""Optional description
columnslist[dict]NoneColumn schema. Each dict needs name and type.
folder_idstr | NoneNonePlace the dataset inside a folder

Column Types

TypePurpose
inputAgent input — prompt, user message, payload
outputAgent output — the actual answer
expected_outputGround truth used for scoring
expected_tool_callsExpected tool invocations
expected_stepsExpected reasoning steps
scenarioScenario description for agentic tests
conversation_historyPrior turns for multi-turn tests
filesFile attachments
variableArbitrary variable
customAnything else

Get, List, Update, Delete

dataset = st.dataset.get(dataset.id)
datasets = st.dataset.list(folder_id=None, limit=100, offset=0)
dataset = st.dataset.update(dataset.id, name="Checkout v2")
st.dataset.delete(dataset.id)

Add Entries

Single entry:

entry = st.dataset.add_entry(
    dataset_id=dataset.id,
    data={
        "input": "Cancel order #1234",
        "expected_output": "Canceled. Refund in 3-5 days.",
    },
    split="test",
)

Bulk:

entries = st.dataset.add_entries(
    dataset_id=dataset.id,
    entries=[
        {"input": "...", "expected_output": "..."},
        {"input": "...", "expected_output": "..."},
    ],
    split="train",
)
ParameterTypeDefaultDescription
dataset_idstrrequiredTarget dataset
datadict[str, Any]requiredColumn name → value
split"train" | "test" | "validation" | NoneNoneSplit assignment
source_trace_idstr | NoneNoneTrace this entry was curated from
source_span_idstr | NoneNoneSpan this entry was curated from

Splits

Every entry can be tagged with a split — train, test, or validation. Pass split=... to any method that takes it. Use SplitType.TEST if you prefer the enum:

from staso.dataset import SplitType

st.dataset.add_entry(
    dataset_id=dataset.id,
    data={"input": "..."},
    split=SplitType.TEST,
)

List, Update, Delete Entries

entries = st.dataset.list_entries(dataset.id, split="test", limit=100)

st.dataset.update_entry(
    dataset.id,
    entry.id,
    data={"expected_output": "Refund in 2-3 days."},
)

st.dataset.delete_entry(dataset.id, entry.id)

update_entry merges the partial data dict into the existing row — you only send the columns that changed.

CSV Import and Export

count = st.dataset.upload_csv(dataset.id, "./tests.csv")
print(f"Imported {count} entries")

path = st.dataset.download_csv(dataset.id, "./exported.csv", split="test")

Headers become column names. Columns that don't exist yet are added to the dataset automatically. Use split=... to round-trip a single split.

Curate From Traces

Turn real production conversations into a dataset:

dataset = st.dataset.from_traces(
    name="Production edge cases",
    trace_ids=["trace_abc", "trace_def"],
    mapping={"Input": "input", "Output": "output"},
    description="Spans that blew up in prod last week",
)

Every span in the listed traces becomes a dataset entry. Staso extracts readable text from provider payloads — Anthropic content blocks, OpenAI messages, Claude Code prompts, tool calls — and drops it into the mapped columns. Thinking blocks are skipped.

ParameterTypeDefaultDescription
namestrrequiredDataset name
trace_idslist[str]requiredTraces to pull spans from
mappingdict[str, str] | NoneNoneColumn name → extractor key (see below)
descriptionstr""Optional description

Supported extractor keys: input, output, tool_calls, model, kind, span_name, status, trace_id, session_id, error_message, duration_ms, input_tokens, output_tokens, total_tokens.

If you pass no mapping, the dataset is created with the default template and columns are auto-mapped by name.

Folders

Organize datasets into folders. Folders are scoped to your workspace.

folder = st.dataset.create_folder(name="Checkout", parent_id=None)

nested = st.dataset.create_folder(name="Refunds", parent_id=folder.id)

folders = st.dataset.list_folders(parent_id=None)

st.dataset.delete_folder(folder.id)

Pass folder_id=folder.id to st.dataset.create() to place a new dataset inside a folder.

Evaluate

Run your agent against a dataset. Every row gets wrapped in a trace, scored by your scorers, and the results are returned.

def my_agent(row: dict) -> str:
    return call_claude(row["input"])

def exact_match(row: dict, output: str) -> float:
    return 1.0 if output == row["expected_output"] else 0.0

summary = st.dataset.evaluate(
    dataset_id=dataset.id,
    fn=my_agent,
    scorers=[exact_match],
    split="test",
    max_concurrency=4,
)

print(f"{summary.passed}/{summary.total} passed")
print(f"Avg exact_match: {summary.scores['exact_match']:.2%}")

for result in summary.results:
    if not result.passed:
        print(f"FAIL {result.entry_id}: {result.actual}")
ParameterTypeDefaultDescription
dataset_idstrrequiredTarget dataset
fnCallable[[dict], Any]requiredYour agent function. Receives the row data, returns the output.
scorerslist[Callable[[dict, Any], float]]NoneScoring functions. Each returns a float 0-1.
split"train" | "test" | "validation" | NoneNoneRun on a specific split
max_concurrencyint1Parallel workers
traceboolTrueWrap every call in st.span for tracing

The returned EvalSummary has:

  • .total, .passed, .failed, .error_count
  • .avg_duration_ms
  • .scores — per-scorer averages
  • .results — list of EvalResult: entry_id, input_data, expected, actual, scores, passed, error, trace_id, duration_ms

Scorers are optional. Omit them if you just want to run your agent across the dataset and capture traces.

Generate Synthetic Entries

Seed a dataset from a prompt or existing rows:

new_entries = st.dataset.generate(
    dataset_id=dataset.id,
    count=20,
    prompt="Generate checkout edge cases that stress test coupon logic",
)

# Or seed from existing rows
new_entries = st.dataset.generate(
    dataset_id=dataset.id,
    count=10,
    seed_entry_ids=[entry_1.id, entry_2.id],
)

Staso uses a high-reasoning model to produce entries that match your column schema. Counts against your monthly synthetic generation limit.

ParameterTypeDefaultDescription
dataset_idstrrequiredTarget dataset
countint10How many entries to generate
promptstr | NoneNoneGeneration prompt
seed_entry_idslist[str] | NoneNoneExisting entries to use as seeds

Errors

Every dataset method raises StasoDatasetError on failure. The exception carries status_code and response_body for debugging:

from staso.dataset import StasoDatasetError

try:
    st.dataset.add_entry(dataset_id="bad", data={"input": "hi"})
except StasoDatasetError as e:
    if e.status_code == 402:
        print("Plan limit reached")
    elif e.status_code == 404:
        print("Dataset not found")
    else:
        raise

Next