Documentation

Build with HiWay2LLM

Learn how smart routing works, integrate any chat-completions client in 2 minutes, and master Budget Controls, auto-recharge and the Guardian system.

Start with the quickstart

Getting started

Quickstart

From signup to your first routed request in under 2 minutes.

Open-source SDK & CLI

Python, TypeScript and a one-liner CLI. MIT-licensed. Use HiWay without touching the dashboard.

Drop-in with your existing SDK

OpenAI, Anthropic, LangChain, Vercel AI SDK, n8n - change one line.

Authentication

How to sign requests with your hw_live_ key.

Concepts

How smart routing works

Seven analyzers, sub-millisecond decision, no LLM call.

Pricing model

BYOK degressive markup. Your providers bill you directly. 9-12.5% on Scale.

Guardian - anti-loop system

Per-workspace rules to block runaway agents, duplicate traffic and cost spikes.

Budget Control

Cap your monthly upstream BYOK spend. Verdict: BLOCK, DOWNGRADE or LIGHT_ONLY.

Provider fallback

When a provider fails, HiWay retries against the cheapest same-tier model. Max 2 retries.

Semantic cache

Skip identical and near-identical requests entirely - zero tokens, instant replies.

Anthropic prompt caching (auto-injected)

We add `cache_control` breakpoints to your Anthropic requests automatically. ~10x cheaper input on cache hits, zero config.

PII masking

Opt-in. Regex on email / phone / card / IBAN / API keys before cache hashing.

Passthrough mode & grace cap

Wallet hits 0? Service keeps running BYOK-direct for 72h / 100k tokens, then a soft stop until you top up.

A/B Experiments

Run N variants of a request in parallel across models. Compare cost, latency, quality.

Response envelope

OpenAI-compatible body + _hiway metadata + X-HiWay-Routed-* headers.

Streaming responses

How HiWay forwards Server-Sent Events end-to-end.

Tool calls and function calls

HiWay is tool-calling transparent across every supported provider.

System prompts and routing

Why the system prompt affects which tier your request is routed to.

Features

Editing Guardian rules

Per-workspace thresholds, alert webhooks, rule-level on/off.

Setting a Budget Control cap

Monthly USD cap on upstream BYOK cost.

Enabling semantic cache

vector-store-backed cache - available on all packs.

Enabling PII masking

Regex masking before cache, embedding and (optionally) provider call.

Running an A/B experiment

Benchmark models on real traffic without writing glue code.

CORTEX v2 - On-board AI

7-layer AI system that monitors, learns your optimal model mix, enforces sovereignty, and lets you chat about your own data.

Security Shield

Two-tier prompt security that blocks injection, jailbreaks, PII leaks, and secrets, under 2 ms, always on.

Configuring the Security Shield

Per-workspace and per-key modes, custom thresholds, IP rules, and SIEM webhooks.

Threat reference

Detailed description of each threat type, example payloads, and tuning guidance.

Security Shield & compliance

How the Security Shield supports GDPR, SOC 2, and enterprise audit requirements.

Router bypass (per-key direct routing)

Skip CORTEX routing for explicit model requests. Forward straight to the provider. Standard markup still applies.

Multimodal

Multimodal quickstart

Generate your first image, audio, or embedding in under 5 minutes.

Quality tiers

T1 / T2 / T3 - cost vs quality tradeoffs, per modality.

Image generation

fal.ai Flux, OpenAI DALL-E 3, Stability AI SD3 - via your own keys.

Video generation

Async jobs, preview gate, fal.ai + Runway - up to 10 seconds.

Audio - TTS & STT

Text-to-speech with 7-day cache. Speech-to-text up to 250 MB.

Embeddings

BYOK embeddings with 30-day Redis cache. OpenAI, Cohere, Voyage AI.

Partners Program

Earn 5-10% of HiWay2LLM fees generated by your referred users - for life.

Partner setup tutorial

Step-by-step guide to earning your first commission in under 10 minutes.

Build with HiWay2LLM

Getting started

Quickstart

Open-source SDK & CLI

Drop-in with your existing SDK

Authentication

Concepts

How smart routing works

Pricing model

Guardian - anti-loop system

Budget Control

Provider fallback

Semantic cache

Anthropic prompt caching (auto-injected)

PII masking

Passthrough mode & grace cap

A/B Experiments

Response envelope

Streaming responses

Tool calls and function calls

System prompts and routing

Features

Editing Guardian rules

Setting a Budget Control cap

Enabling semantic cache

Enabling PII masking

Running an A/B experiment

CORTEX v2 - On-board AI

Security Shield

Configuring the Security Shield

Threat reference

Security Shield & compliance

Router bypass (per-key direct routing)

Multimodal

Multimodal quickstart

Quality tiers

Image generation

Video generation

Audio - TTS & STT

Embeddings

Providers

Provider directory

Anthropic - Claude

fal.ai - Images & Video

Integrations

OpenAI Python SDK

OpenAI Node.js SDK

LangChain

Vercel AI SDK

n8n workflows

curl and raw HTTP

Migrate

From OpenRouter

From LiteLLM

From Vercel AI Gateway

From Portkey

From direct provider APIs (OpenAI / Anthropic)

API reference

POST /v1/chat/completions

GET /v1/me

GET /v1/models

Error codes

Webhooks

Troubleshooting

402 - Quota or budget exceeded

401 - Unauthorized

429 - Rate limited

502 - Upstream unavailable

Frequently Asked Questions

Glossary

Changelog

Partners Program

Partners Program

Partner setup tutorial