Three providers behind one API
ElevenLabs (legacy + new voice catalog, voice cloning support), MiniMax (descriptive voice names per language), OpenAI (six fixed voices: alloy, echo, fable, onyx, nova, shimmer). Switch via the provider field.
Convert text to natural speech across 13+ languages, three providers (ElevenLabs, MiniMax, OpenAI)
Generate high-quality speech audio from text through multiple synthesis providers, all behind a single unified API. The service exposes three providers: ElevenLabs (primary, with legacy voice catalog and full language/gender support), MiniMax (multi-language with descriptive voice names), and OpenAI (six pre-built voices, language-agnostic). Voice selection happens by language + gender from the shared catalog (also used by the Dubbing service), or by explicit voice_id for fine-grained control. Configurable voice_settings (stability, similarity_boost, etc.) and an optional hd_quality flag let you tune output. The generated audio is uploaded to storage and returned as a URL.
ElevenLabs (legacy + new voice catalog, voice cloning support), MiniMax (descriptive voice names per language), OpenAI (six fixed voices: alloy, echo, fable, onyx, nova, shimmer). Switch via the provider field.
Pass language + gender; the service resolves the best matching voice from the provider's catalog. ElevenLabs and MiniMax support 13+ languages; OpenAI is language-agnostic (voices are fixed).
Pass clone_from_audio (URL) or clone_from_video to clone the voice from a sample on the fly. Available on ElevenLabs and MiniMax (capability flag supports_voice_cloning in the catalog).
stability (0-1), similarity_boost (0-1), speed, pitch (-12 to +12 semitones), volume (0.1-10), emotion. Exact support varies by provider — check supports_voice_settings in the catalog response.
Set hd_quality:true to switch to a higher-fidelity model on supported providers. Same per-second pricing.
Generate voiceovers for tutorials, product demos, explainers without hiring a voice actor.
Convert articles, e-books, or documentation to audio for visually impaired users.
Generate dynamic voice prompts in 13+ languages.
Audio lessons and lecture narrations across languages for online courses.
Input
Text + provider + (language + gender) OR voice_id; optional voice_settings, speed, hd_quality, clone_from_audio/video
Output
Generated audio URL with duration metadata; usage.units = duration in seconds (used for billing)
Prerequisites
Inspect all providers, their supported languages, voice IDs, and capability flags.
GET /v1/proxy/core-tts-voices
Response
{
"status": "success",
"data": {
"elevenlabs": {
"legacy_voices": ["rachel", "adam", "antoni", "arnold", "bella", "domi", "elli", "josh", "nicole", "sam"],
"legacy_voice_mapping": {
"rachel": "21m00Tcm4TlvDq8ikWAM",
"adam": "pNInz6obpgDQGcFmaJgB"
},
"languages": ["dutch", "english", "french", "german", "hindi", "indonesian", "italian", "japanese", "korean", "mandarin", "portuguese", "spanish", "turkish"],
"voice_catalog": {
"english": { "female": "56AoDkrOh6qfVPDXZ7Pt", "male": "UgBBYS2sOqTuMpoF3BR0" },
"turkish": { "female": "KbaseEXyT9EE0CQLEfbB", "male": "5WzTv66bK7WWszHUzwZ5" }
},
"supports_custom_voice_id": true,
"supports_voice_settings": true,
"supports_voice_cloning": true,
"supports_language_gender": true
},
"minimax": {
"languages": ["arabic", "cantonese", "english", "french", "german", "indonesian", "italian", "japanese", "korean", "mandarin", "portuguese", "spanish", "turkish"],
"voice_catalog": {
"english": { "male": "English_Trustworth_Man", "female": "English_CalmWoman" },
"turkish": { "male": "Turkish_Trustworthyman", "female": "Turkish_CalmWoman" }
},
"supports_custom_voice_id": true,
"supports_voice_settings": true,
"supports_voice_cloning": true,
"supports_language_gender": true
},
"openai": {
"voices": ["alloy", "echo", "fable", "onyx", "nova", "shimmer"],
"voice_mapping": { "alloy": "alloy", "echo": "echo" },
"supports_custom_voice_id": false,
"supports_voice_settings": false,
"supports_voice_cloning": false,
"supports_language_gender": false,
"note": "OpenAI voices are language-agnostic"
}
}
}Three providers always returned. ElevenLabs has both legacy_voices (named keys like 'rachel') AND voice_catalog (language/gender). OpenAI has neither — only the six fixed voices. MiniMax voice IDs are descriptive strings (e.g. 'English_Trustworth_Man').
POST text + (language + gender) OR voice_id. Default provider is elevenlabs. The response carries the audio_url and duration_seconds.
{
"text": "Welcome to our platform.",
"provider": "elevenlabs",
"language": "english",
"gender": "female",
"speed": 1.0,
"hd_quality": false,
"voice_settings": { "stability": 0.85, "similarity_boost": 0.85 }
}Response
{
"status": "success",
"data": {
"url": "https://s3.example.com/audio/tts/.../abc.mp3",
"key": "audio/tts/apk_8snm0/.../abc.mp3",
"folder_id": "f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6",
"provider": "elevenlabs",
"voice_id": "56AoDkrOh6qfVPDXZ7Pt",
"cloned_voice_id": null,
"duration_seconds": 3.2
},
"usage": { "units": 3.2, "unit_type": "second", "details": { "provider": "elevenlabs", "text_length": 32, "cloned_voice_used": false } },
"gateway": { "request_id": "req_abc", "service": "core-tts" },
"cost": { "units": 3.2, "unit_price": 0.01, "tokens": 0.032, "balance": 99.97 }
}data.url (NOT audio_url) is the presigned/public URL — download or stream from there. cloned_voice_id is only set when clone_from_audio/clone_from_video was used. Billing per second of generated audio (cost.units = duration_seconds).
Convert text to speech audio. The voice is resolved by language + gender from the chosen provider's catalog, OR by explicit voice_id (overrides). Supports per-call voice cloning via clone_from_audio / clone_from_video.
/v1/proxy/core-tts
Browse the full voice catalog for all three providers. Returns supported languages, voice IDs (per language/gender), legacy voice aliases (ElevenLabs only), and capability flags.
/v1/proxy/core-tts-voices
Per-second billing on synthesized audio. Browsing voices is free.
| Service | Unit | Price |
|---|---|---|
| Text to Speech | second | $0.01/second |
| List Voices | item | Free |
A: About 5000 characters. Split longer content at sentence boundaries.
A: Pass language + gender, OR voice_id (which overrides), OR voice_name (legacy ElevenLabs aliases). For OpenAI, only voice_id is meaningful (alloy/echo/fable/onyx/nova/shimmer).
A: Switches to a higher-fidelity model on providers that support it (ElevenLabs primarily). Same per-second pricing.
A: Yes. Either (a) pass voice_id with the cloned voice's ID returned by core-voice-clone, or (b) use clone_from_audio per-call (no persistence).
A: No. Billing is on generated duration regardless of speed.
A: OpenAI exposes only six fixed voices, language-agnostic, no settings/cloning. The API surfaces this asymmetry in the catalog (note field + supports_* all false).
1.1 (2026-04-27)
1.0 (2026-01-01)