Voice Clone + TTS

Clone any voice from an audio sample and generate speech with it

Two distinct workflows for voice cloning: 1. **Persistent clone** (`core-voice-clone`, async/delayed): upload an audio sample, the gateway queues a job, and you get back a reusable voice_id you can pass to `core-tts` later. One $5 flat charge per clone; subsequent TTS calls are billed at standard per-second rates. 2. **Instant clone + synthesize** (`core-instant-clone-tts`, sync, $2/request): one shot — send an audio URL plus the text, the upstream profiles the voice on the fly and returns the synthesised audio as a binary `audio/wav` stream. No persistent voice profile is stored. Audio samples are passed as URLs. Use CN8's `upload-media` to upload a local file and obtain a public URL. If the source is a video, the upstream extracts the audio track automatically (up to ~60 seconds via ffmpeg).

voiceaudiovoice-cloneasync

Overview

Features

Persistent voice clone (async)

core-voice-clone is a delayed service: 202 + job_id, processed by the gateway worker, eventual job result includes voice_id, message, provider. Reuse the voice_id with core-tts at standard pricing.

Instant clone + TTS (sync, binary)

core-instant-clone-tts profiles the voice and synthesises in one request. The response is binary audio/wav (NOT JSON) with custom headers (X-Used-Voice-Id, X-Provider, X-Text-Length).

Audio or video source

audioUrl can point to either an audio file (mp3/wav/ogg) or a video file — the upstream extracts up to 60s of audio with ffmpeg before profiling.

Provider choice

Defaults to elevenlabs. minimax also supports cloning. The cloned voice_id is provider-specific — pass the matching provider when using it later in core-tts.

Use Cases

Brand voice

Record a spokesperson once, clone, then generate all future ads / promos / announcements via core-tts in that voice.

Character voices

Clone game / audiobook character voices and generate dialogue at scale.

Quick one-off clip

Use instant clone when you need a single audio output without storing a voice profile.

Localization with same speaker

Clone the original speaker once, then synthesise dubbed lines in multiple languages keeping voice character (combine with TTS or Dubbing).

Input / Output

Input

audioUrl (audio or video) + provider; for instant clone also text + optional voice_settings

JSON body

Output

Persistent clone: 202 + job_id (poll /v2/jobs/{job_id} → result_data.voice_id). Instant clone: binary audio/wav stream + headers.

JSON (persistent)audio/wav binary (instant)

Specs

Latency
Persistent clone: ~1-3 min (async). Instant clone: ~5-15s (sync).
Async
true
Rate Limit
60 req/min per API key
Max Input
30+ seconds audio recommended for persistent clone; 5000 chars for instant clone text

Quickstart

Prerequisites

  • -A CN8 API key with core-voice-clone and/or core-instant-clone-tts in allowed_services
  • -Audio sample URL (use upload-media to upload a local file)

1. Upload the audio sample

upload-media

Use upload-media to upload a 30+ second clean recording. You'll get a public_url to use as audioUrl.

POST
{ "media_type": "audio", "filename": "speaker.wav", "content_type": "audio/wav" }

Response

{
  "status": "success",
  "data": {
    "upload_url": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?X-Amz-Algorithm=...",
    "object_key": "uploads/prof_123/audio/.../speaker.wav",
    "expires_in": 86400,
    "public_url": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?X-Amz-Algorithm=..."
  }
}

PUT your file to upload_url with the matching Content-Type, then use public_url as audioUrl below.

2a. Persistent clone (async — 202 + job_id)

core-voice-clone

Send the audioUrl. The gateway queues a job and returns 202 immediately. Poll /v2/jobs/{job_id} until status=completed; result_data carries the voice_id.

POST/v1/proxy/core-voice-clone
{
  "audioUrl": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?...",
  "provider": "elevenlabs"
}

Response

{
  "status": "accepted",
  "message": "Job queued for processing",
  "job": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "queued",
    "service": "core-voice-clone",
    "created_at": "2026-04-27T10:30:00Z"
  }
}

Then poll `GET /v2/jobs/{job_id}` until status=completed. Final result_data: ``` { "status": "completed", "result_data": { "status": "success", "message": "Voice cloned successfully with elevenlabs", "voice_id": "abc123xyz...", "provider": "elevenlabs" }, "units_consumed": 1.0, "token_cost": 5.0 } ``` Use voice_id with core-tts.

2b. Instant clone + synthesize (sync, BINARY response)

core-instant-clone-tts

One request, one response — but the response is BINARY audio/wav, not JSON. Save the body to a .wav file.

POST/v1/proxy/core-instant-clone-tts
{
  "audioUrl": "https://s3.eu-central-1.wasabisys.com/.../sample.wav?...",
  "text": "Hello, this is my cloned voice speaking.",
  "provider": "elevenlabs"
}

Response

# HTTP/1.1 200 OK
# Content-Type: audio/wav
# X-Used-Voice-Id: abc123xyz
# X-Provider: elevenlabs
# X-Text-Length: 41
# X-Gateway-Request-ID: req_abc
# X-Gateway-Units: 1.0
# X-Gateway-Token-Cost: 2.000000
# X-Gateway-Credit-Balance: 99.000000

<binary WAV bytes>

- Response is RAW audio/wav bytes — no JSON envelope. - Inspect the response headers for metadata (X-Used-Voice-Id, X-Provider, X-Text-Length) and gateway billing (X-Gateway-* headers). - In curl: `curl ... --output out.wav` to save directly.

3. Use a persistent cloned voice with core-tts

core-tts

Once you have a voice_id from step 2a, synthesize as many clips as you want at standard core-tts pricing.

POST
{
  "text": "This is generated with my cloned voice.",
  "voice_id": "abc123xyz...",
  "provider": "elevenlabs"
}

Response

{
  "status": "success",
  "data": {
    "url": "https://s3.example.com/audio/tts/.../abc.mp3",
    "key": "audio/tts/.../abc.mp3",
    "folder_id": "...",
    "provider": "elevenlabs",
    "voice_id": "abc123xyz...",
    "cloned_voice_id": null,
    "duration_seconds": 2.1
  },
  "usage": { "units": 2.1, "unit_type": "second" },
  "cost": { "units": 2.1, "unit_price": 0.01, "tokens": 0.021, "balance": 96.97 }
}

Standard per-second TTS pricing (no extra fee for cloned voice). voice_id must match the provider it was cloned with.

Clone Voice

POSTasync

Persistent voice clone. Async/delayed: returns 202 + job_id, processed by the gateway worker. Poll /v2/jobs/{job_id} for the voice_id.

/v1/proxy/core-voice-clone

Instant Clone + TTS

POSTsync

Profile the voice from audioUrl AND synthesise the text in one synchronous request. Response is binary audio/wav — NOT JSON. No stored voice profile.

/v1/proxy/core-instant-clone-tts

List TTS Voices

GETsync

Shared with the TTS product — see tts.yaml for full schema. Use it here to confirm provider capability flags (supports_voice_cloning) before sending a clone request.

/v1/proxy/core-tts-voices

Pricing

Persistent clone is a one-time per-clone fee; instant clone is per-request; subsequent TTS uses are billed at standard per-second rates.

ServiceUnitPrice
Persistent Cloneitem$5.0/clone (one-time)
Instant Clone + TTSitem$2.0/request
TTS with cloned voicesecond$0.01/second (standard core-tts rate)
List VoicesitemFree
  • -Break-even: persistent ($5 + $0.01/s) vs instant ($2/request). For >~3 clips of the same voice, persistent + TTS is cheaper.
  • -Instant clone returns BINARY audio/wav — there is no audio_url to download separately; the response body itself is the audio.
  • -Voice cloning is supported on ElevenLabs and MiniMax. OpenAI does not support cloning (supports_voice_cloning=false in the catalog).

Guides & Tips

Audio quality for cloning

  • -Length: 30+ seconds of clear speech for best results. Shorter samples may work for instant clone but quality varies.
  • -Environment: quiet recording, minimal background noise, consistent volume.
  • -Format: WAV or high-bitrate MP3 preferred. mp3/wav/ogg supported.
  • -Content: single speaker, natural speech, no music or overlapping voices.

Persistent vs instant clone

  • -Persistent ($5 once) + standard TTS per second: best when generating many clips with the same voice (brand voice, character).
  • -Instant ($2/request): best for one-off clips, demos, or when you don't want to manage stored voices.
  • -Returns BINARY audio (instant) vs URL (persistent + tts) — choose based on how you want to handle the output.

Persistent clone polling

  • -core-voice-clone is `is_delayed: true` (async). The gateway returns 202 + job_id immediately and processes the upstream call in a worker.
  • -Poll `GET /v2/jobs/{job_id}` until status = completed (typically 1-3 min). The result_data carries voice_id, message, and provider.
  • -Save voice_id and pass it as `voice_id` to core-tts in subsequent calls.

Video as voice source

  • -audioUrl can point to a video file. The upstream extracts the audio track automatically (~60s, 44.1kHz stereo) using ffmpeg, profiles the voice, then discards the temp audio.
  • -Use upload-media with media_type=video to upload a video and get a URL.

Field naming quirks

  • -`audioUrl` is camelCase (not snake_case `audio_url`). Both core-voice-clone and core-instant-clone-tts use this naming. Sending `audio_url` returns a 422.
  • -`voice_settings` IS snake_case (matches core-tts).
  • -`core-instant-clone-tts` returns binary `audio/wav` — `Content-Type` and `X-*` headers are the only structured info.

FAQ

Q: How long does persistent voice cloning take?

A: Async, typically 1-3 minutes. Poll GET /v2/jobs/{job_id} until status=completed; result_data includes voice_id.

Q: Why is the field name audioUrl and not audio_url?

A: Upstream Pydantic uses camelCase here (legacy naming). Match the field name exactly or you'll get a 422.

Q: Why is core-instant-clone-tts response binary instead of JSON?

A: The upstream streams the synthesised .wav directly without uploading to storage first (saves a round-trip). Save the response body as a .wav file. Metadata is in headers.

Q: Can I clone from a video file?

A: Yes — pass the video URL as audioUrl. The upstream extracts up to ~60s of audio via ffmpeg before profiling.

Q: Is instant clone cheaper than persistent + TTS for many clips?

A: No. For >~3 clips in the same voice, persistent ($5 once) + TTS ($0.01/s) is cheaper. Instant ($2/request) wins for 1-2 short clips.

Q: How do I use a cloned voice for TTS?

A: After the persistent clone job completes, pass the returned voice_id to core-tts via the voice_id field. Make sure provider matches what the clone was made with.

Q: Can I use a cloned voice for dubbing?

A: Use the Voice Clone Dubbing product (voice-clone-dubbing.yaml / core-dubbing-clone) which handles cloning + dubbing in one pipeline.

Related Products

Changelog

1.1 (2026-04-27)

  • -Corrected core-voice-clone response shape to the standard delayed-service envelope (202 + job_id), with polling instructions and result_data shape.
  • -Corrected core-instant-clone-tts response: it returns BINARY audio/wav (NOT JSON). Documented the headers (X-Used-Voice-Id, X-Provider, X-Text-Length) and gateway X-Gateway-* headers. Removed the false JSON example.
  • -Documented the camelCase audioUrl field naming explicitly; added 422 error entry.
  • -Replaced backend_a/backend_b placeholders in voices examples with concrete provider names + supports_voice_cloning flag.
  • -Added voice_settings field to core-instant-clone-tts (Pydantic InstantCloneTTSRequest).
  • -Added break-even guidance (persistent ~3 clips beats instant).

1.0 (2026-01-26)

  • -Initial release: persistent voice clone, instant clone + TTS, voice listing.