A/B Experiments

Run N variants of a request in parallel across models. Compare cost, latency, quality.

A/B Experiments let you benchmark models on your actual production traffic without writing glue code. You define an experiment with 2-5 candidate models, a sample rate (e.g. 5% of matching requests), and a stop condition. HiWay fan-outs those requests to every candidate in parallel, records cost and latency, and lets you tag outcomes for quality scoring.

Included on Scale and Enterprise

A/B Experiments are available on Scale and Enterprise plans.

Experiment config

json

{
  "name":          "haiku-vs-gpt4o-mini-on-classification",
  "candidates":    ["anthropic/claude-haiku-4-5", "openai/gpt-4o-mini"],
  "sample_rate":   0.05,
  "match_filter":  { "tier": "light", "has_tools": false },
  "stop_after":    { "requests": 1000 },
  "primary_metric": "cost_per_request"
}

What you get

Per-candidate aggregate: average cost, p50/p95 latency, error rate, sample size
Pairwise winner with confidence interval (one-sided t-test on cost)
Optional human quality scoring via POST /v1/experiments/:id/score
Export raw per-request data to CSV