Audio - TTS & STT

Text-to-speech with 7-day cache. Speech-to-text up to 250 MB.

TTS - Text-to-speech

POST /v2/audio/tts returns a audio_file_id. Stream the audio with GET /v2/audio/file/{file_id}. Responses are cached in Redis for 7 days based on {text, provider, model, voice, speed}.

OpenAI - tts-1 T1 ($15/1M chars) · tts-1-hd T2 ($30/1M). Voices: alloy, echo, fable, onyx, nova, shimmer.
ElevenLabs - eleven_turbo_v2_5 T1 ($22/1M) · eleven_multilingual_v2 T3 ($99/1M).

STT - Speech-to-text

POST /v2/audio/stt - multipart form upload. Accepts MP3, WAV, FLAC, M4A, OGG, WEBM. Max 25 MB for Whisper, 250 MB for Deepgram.

Deepgram - nova-2 T1 ($0.0043/min)
OpenAI - whisper-1 T2 ($0.006/min) - 25 MB limit