Text to Speech

Convert text to natural speech across 13+ languages, three providers (ElevenLabs, MiniMax, OpenAI)

Generate high-quality speech audio from text through multiple synthesis providers, all behind a single unified API. The service exposes three providers: ElevenLabs (primary, with legacy voice catalog and full language/gender support), MiniMax (multi-language with descriptive voice names), and OpenAI (six pre-built voices, language-agnostic). Voice selection happens by language + gender from the shared catalog (also used by the Dubbing service), or by explicit voice_id for fine-grained control. Configurable voice_settings (stability, similarity_boost, etc.) and an optional hd_quality flag let you tune output. The generated audio is uploaded to storage and returned as a URL.

voiceaudiotext-to-speech

Overview

Features

Three providers behind one API

ElevenLabs (legacy + new voice catalog, voice cloning support), MiniMax (descriptive voice names per language), OpenAI (six fixed voices: alloy, echo, fable, onyx, nova, shimmer). Switch via the provider field.

Language + gender voice selection

Pass language + gender; the service resolves the best matching voice from the provider's catalog. ElevenLabs and MiniMax support 13+ languages; OpenAI is language-agnostic (voices are fixed).

Voice cloning support

Pass clone_from_audio (URL) or clone_from_video to clone the voice from a sample on the fly. Available on ElevenLabs and MiniMax (capability flag supports_voice_cloning in the catalog).

Configurable voice settings

stability (0-1), similarity_boost (0-1), speed, pitch (-12 to +12 semitones), volume (0.1-10), emotion. Exact support varies by provider — check supports_voice_settings in the catalog response.

HD quality

Set hd_quality:true to switch to a higher-fidelity model on supported providers. Same per-second pricing.

Use Cases

Video narration

Generate voiceovers for tutorials, product demos, explainers without hiring a voice actor.

Accessibility

Convert articles, e-books, or documentation to audio for visually impaired users.

IVR / phone systems

Generate dynamic voice prompts in 13+ languages.

E-learning

Audio lessons and lecture narrations across languages for online courses.

Input / Output

Input

Text + provider + (language + gender) OR voice_id; optional voice_settings, speed, hd_quality, clone_from_audio/video

JSON body

Output

Generated audio URL with duration metadata; usage.units = duration in seconds (used for billing)

JSON (audio_url field)

Specs

Latency
~1-5s depending on text length and provider
Async
false
Rate Limit
60 req/min per API key
Max Input
~5000 characters per request

Quickstart

Prerequisites

  • -A CN8 API key with core-tts in allowed_services

1. List available voices

core-tts-voices

Inspect all providers, their supported languages, voice IDs, and capability flags.

GET/v1/proxy/core-tts-voices
GET /v1/proxy/core-tts-voices

Response

{
  "status": "success",
  "data": {
    "elevenlabs": {
      "legacy_voices": ["rachel", "adam", "antoni", "arnold", "bella", "domi", "elli", "josh", "nicole", "sam"],
      "legacy_voice_mapping": {
        "rachel": "21m00Tcm4TlvDq8ikWAM",
        "adam": "pNInz6obpgDQGcFmaJgB"
      },
      "languages": ["dutch", "english", "french", "german", "hindi", "indonesian", "italian", "japanese", "korean", "mandarin", "portuguese", "spanish", "turkish"],
      "voice_catalog": {
        "english": { "female": "56AoDkrOh6qfVPDXZ7Pt", "male": "UgBBYS2sOqTuMpoF3BR0" },
        "turkish": { "female": "KbaseEXyT9EE0CQLEfbB", "male": "5WzTv66bK7WWszHUzwZ5" }
      },
      "supports_custom_voice_id": true,
      "supports_voice_settings": true,
      "supports_voice_cloning": true,
      "supports_language_gender": true
    },
    "minimax": {
      "languages": ["arabic", "cantonese", "english", "french", "german", "indonesian", "italian", "japanese", "korean", "mandarin", "portuguese", "spanish", "turkish"],
      "voice_catalog": {
        "english": { "male": "English_Trustworth_Man", "female": "English_CalmWoman" },
        "turkish": { "male": "Turkish_Trustworthyman", "female": "Turkish_CalmWoman" }
      },
      "supports_custom_voice_id": true,
      "supports_voice_settings": true,
      "supports_voice_cloning": true,
      "supports_language_gender": true
    },
    "openai": {
      "voices": ["alloy", "echo", "fable", "onyx", "nova", "shimmer"],
      "voice_mapping": { "alloy": "alloy", "echo": "echo" },
      "supports_custom_voice_id": false,
      "supports_voice_settings": false,
      "supports_voice_cloning": false,
      "supports_language_gender": false,
      "note": "OpenAI voices are language-agnostic"
    }
  }
}

Three providers always returned. ElevenLabs has both legacy_voices (named keys like 'rachel') AND voice_catalog (language/gender). OpenAI has neither — only the six fixed voices. MiniMax voice IDs are descriptive strings (e.g. 'English_Trustworth_Man').

2. Generate Speech

core-tts

POST text + (language + gender) OR voice_id. Default provider is elevenlabs. The response carries the audio_url and duration_seconds.

POST/v1/proxy/core-tts
{
  "text": "Welcome to our platform.",
  "provider": "elevenlabs",
  "language": "english",
  "gender": "female",
  "speed": 1.0,
  "hd_quality": false,
  "voice_settings": { "stability": 0.85, "similarity_boost": 0.85 }
}

Response

{
  "status": "success",
  "data": {
    "url": "https://s3.example.com/audio/tts/.../abc.mp3",
    "key": "audio/tts/apk_8snm0/.../abc.mp3",
    "folder_id": "f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6",
    "provider": "elevenlabs",
    "voice_id": "56AoDkrOh6qfVPDXZ7Pt",
    "cloned_voice_id": null,
    "duration_seconds": 3.2
  },
  "usage": { "units": 3.2, "unit_type": "second", "details": { "provider": "elevenlabs", "text_length": 32, "cloned_voice_used": false } },
  "gateway": { "request_id": "req_abc", "service": "core-tts" },
  "cost": { "units": 3.2, "unit_price": 0.01, "tokens": 0.032, "balance": 99.97 }
}

data.url (NOT audio_url) is the presigned/public URL — download or stream from there. cloned_voice_id is only set when clone_from_audio/clone_from_video was used. Billing per second of generated audio (cost.units = duration_seconds).

Text to Speech

POSTsync

Convert text to speech audio. The voice is resolved by language + gender from the chosen provider's catalog, OR by explicit voice_id (overrides). Supports per-call voice cloning via clone_from_audio / clone_from_video.

/v1/proxy/core-tts

List TTS Voices

GETsync

Browse the full voice catalog for all three providers. Returns supported languages, voice IDs (per language/gender), legacy voice aliases (ElevenLabs only), and capability flags.

/v1/proxy/core-tts-voices

Pricing

Per-second billing on synthesized audio. Browsing voices is free.

ServiceUnitPrice
Text to Speechsecond$0.01/second
List VoicesitemFree
  • -Billing is based on generated audio duration (usage.units = duration_seconds), not input text length.
  • -A typical 1-minute narration costs approximately $0.60.
  • -Voice cloning via clone_from_audio/clone_from_video is included — no separate cloning charge.

Guides & Tips

Choose a provider

  • -ElevenLabs: best language coverage, best voice quality, full settings + cloning support. Includes legacy voice names (rachel, adam…) for backward compatibility.
  • -MiniMax: alternative provider with descriptive voice names (English_Trustworth_Man, Turkish_CalmWoman). 13 languages.
  • -OpenAI: six fixed voices (alloy/echo/fable/onyx/nova/shimmer), language-agnostic, no settings, no cloning.

Voice selection rules

  • -Resolution order in core-tts:
  • -1. If `voice_id` is set → use it directly (overrides everything).
  • -2. Else if `voice_name` is set (ElevenLabs only) → resolve via legacy_voice_mapping.
  • -3. Else use `language` + `gender` from voice_catalog.
  • -4. If exact (language, gender) is unavailable → provider may fall back to opposite gender or English.

Voice cloning per call

  • -Pass `clone_from_audio` (URL) to clone the voice from an audio sample for the duration of the request.
  • -Set `keep_source_file: true` (with optional `auto_delete_after_hours`) to retain the upload longer.
  • -Cloning is a single-call feature — for persistent custom voices, use the Voice Clone service (voice-clone-tts.yaml).

Voice settings reference

  • -`stability` (0-1): consistency of voice across the output.
  • -`similarity_boost` (0-1): how closely the output matches the target voice profile.
  • -`speed`: multiplier (does not change billing).
  • -`pitch` (-12 to 12): semitones up or down.
  • -`volume` (0.1-10): output volume multiplier.
  • -`emotion`: emotional tone hint (provider-specific).
  • -Unsupported settings for a provider are silently ignored.

FAQ

Q: What is the maximum text length per request?

A: About 5000 characters. Split longer content at sentence boundaries.

Q: How does voice selection work?

A: Pass language + gender, OR voice_id (which overrides), OR voice_name (legacy ElevenLabs aliases). For OpenAI, only voice_id is meaningful (alloy/echo/fable/onyx/nova/shimmer).

Q: What does HD quality do?

A: Switches to a higher-fidelity model on providers that support it (ElevenLabs primarily). Same per-second pricing.

Q: Can I use a previously cloned voice?

A: Yes. Either (a) pass voice_id with the cloned voice's ID returned by core-voice-clone, or (b) use clone_from_audio per-call (no persistence).

Q: Does speed affect pricing?

A: No. Billing is on generated duration regardless of speed.

Q: Why is OpenAI's response shape different?

A: OpenAI exposes only six fixed voices, language-agnostic, no settings/cloning. The API surfaces this asymmetry in the catalog (note field + supports_* all false).

Related Products

Changelog

1.1 (2026-04-27)

  • -Documented all three providers explicitly (elevenlabs / minimax / openai) — replaced the generic backend_a/backend_b placeholders.
  • -Added the full TTSRequest body schema from the upstream Pydantic model: provider (default elevenlabs), voice_name (legacy ElevenLabs), clone_from_audio, clone_from_video, keep_source_file, auto_delete_after_hours, cinema8_env.
  • -Documented the per-call voice-cloning fields (clone_from_audio, clone_from_video) and how they relate to the persistent clone in voice-clone-tts.
  • -Updated voices response example with real shape: legacy_voices + legacy_voice_mapping (ElevenLabs only), descriptive MiniMax voice names, OpenAI's note field.
  • -Added cost object to the response example so users know what they were billed.

1.0 (2026-01-01)

  • -Initial release with multi-backend TTS and voice listing.