v0.11.0
latestEvaluations and Agent Judge — score every agent run
- NEWEvaluations scores agent runs after they finish — point a rule at any trace, session, agent, dataset, or time range
- NEWNine zero-config rules grade task completion, error rate, latency, cost, hallucination, and more
- NEWAgent Judge grades multi-file agent outputs — a fresh agent reads the files against your rubric and shows its work
- NEWDeterministic prechecks gate every Agent Judge run, so structural failures fail fast with no LLM spent
- NEWWrite custom eval rules as an LLM prompt, your own Python, or an agentic judge
- NEWCompare two eval runs side by side and see exactly which verdicts flipped
- NEWRun evals straight from the SDK — kick off a run, poll, and pull verdicts in a few lines
- IMPGuard and Eval rules now share one Rules page — promote an eval rule to a real-time block in a click
- IMPNew agent pages surface recent verdicts, diagnoses, traces, and drift in one place
- IMPUpload agent-run file bundles to a dataset and browse them in the dashboard
- IMPAdded a team tier to pricing
- PRFEval LLM judges batch per trace, cutting judge calls by an order of magnitude
- SECEval routes are gated by per-action permissions and plan limits
- DOCNew eval docs cover runs, verdicts, compare, and Agent Judge
+55 under-the-hood changes shipped