A/B Experiments (Scale+)

Run N variants of a request in parallel across models. Compare cost, latency, quality.

A/B Experiments let you benchmark models on your actual production traffic without writing glue code. You define an experiment with 2-5 candidate models, a sample rate (e.g. 5% of matching requests), and a stop condition. HiWay fan-outs those requests to every candidate in parallel, records cost and latency, and lets you tag outcomes for quality scoring.

Scale plan and above

A/B Experiments are available on Scale, Business and Enterprise.

Experiment config

json
{
  "name":          "haiku-vs-gpt4o-mini-on-classification",
  "candidates":    ["anthropic/claude-haiku-4-5", "openai/gpt-4o-mini"],
  "sample_rate":   0.05,
  "match_filter":  { "tier": "light", "has_tools": false },
  "stop_after":    { "requests": 1000 },
  "primary_metric": "cost_per_request"
}

What you get

  • Per-candidate aggregate: average cost, p50/p95 latency, error rate, sample size
  • Pairwise winner with confidence interval (one-sided t-test on cost)
  • Optional human quality scoring via POST /v1/experiments/:id/score
  • Export raw per-request data to CSV