A/B Experiments (Scale+)
Run N variants of a request in parallel across models. Compare cost, latency, quality.
A/B Experiments let you benchmark models on your actual production traffic without writing glue code. You define an experiment with 2-5 candidate models, a sample rate (e.g. 5% of matching requests), and a stop condition. HiWay fan-outs those requests to every candidate in parallel, records cost and latency, and lets you tag outcomes for quality scoring.
Scale plan and above
A/B Experiments are available on Scale, Business and Enterprise.
Experiment config
json
{
"name": "haiku-vs-gpt4o-mini-on-classification",
"candidates": ["anthropic/claude-haiku-4-5", "openai/gpt-4o-mini"],
"sample_rate": 0.05,
"match_filter": { "tier": "light", "has_tools": false },
"stop_after": { "requests": 1000 },
"primary_metric": "cost_per_request"
}What you get
- Per-candidate aggregate: average cost, p50/p95 latency, error rate, sample size
- Pairwise winner with confidence interval (one-sided t-test on cost)
- Optional human quality scoring via
POST /v1/experiments/:id/score - Export raw per-request data to CSV