Persistent voice clone (async)
core-voice-clone is a delayed service: 202 + job_id, processed by the gateway worker, eventual job result includes voice_id, message, provider. Reuse the voice_id with core-tts at standard pricing.
Clone any voice from an audio sample and generate speech with it
Two distinct workflows for voice cloning: 1. **Persistent clone** (`core-voice-clone`, async/delayed): upload an audio sample, the gateway queues a job, and you get back a reusable voice_id you can pass to `core-tts` later. One $5 flat charge per clone; subsequent TTS calls are billed at standard per-second rates. 2. **Instant clone + synthesize** (`core-instant-clone-tts`, sync, $2/request): one shot — send an audio URL plus the text, the upstream profiles the voice on the fly and returns the synthesised audio as a binary `audio/wav` stream. No persistent voice profile is stored. Audio samples are passed as URLs. Use CN8's `upload-media` to upload a local file and obtain a public URL. If the source is a video, the upstream extracts the audio track automatically (up to ~60 seconds via ffmpeg).
core-voice-clone is a delayed service: 202 + job_id, processed by the gateway worker, eventual job result includes voice_id, message, provider. Reuse the voice_id with core-tts at standard pricing.
core-instant-clone-tts profiles the voice and synthesises in one request. The response is binary audio/wav (NOT JSON) with custom headers (X-Used-Voice-Id, X-Provider, X-Text-Length).
audioUrl can point to either an audio file (mp3/wav/ogg) or a video file — the upstream extracts up to 60s of audio with ffmpeg before profiling.
Defaults to elevenlabs. minimax also supports cloning. The cloned voice_id is provider-specific — pass the matching provider when using it later in core-tts.
Record a spokesperson once, clone, then generate all future ads / promos / announcements via core-tts in that voice.
Clone game / audiobook character voices and generate dialogue at scale.
Use instant clone when you need a single audio output without storing a voice profile.
Clone the original speaker once, then synthesise dubbed lines in multiple languages keeping voice character (combine with TTS or Dubbing).
Input
audioUrl (audio or video) + provider; for instant clone also text + optional voice_settings
Output
Persistent clone: 202 + job_id (poll /v2/jobs/{job_id} → result_data.voice_id). Instant clone: binary audio/wav stream + headers.
Prerequisites
Use upload-media to upload a 30+ second clean recording. You'll get a public_url to use as audioUrl.
{ "media_type": "audio", "filename": "speaker.wav", "content_type": "audio/wav" }Response
{
"status": "success",
"data": {
"upload_url": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?X-Amz-Algorithm=...",
"object_key": "uploads/prof_123/audio/.../speaker.wav",
"expires_in": 86400,
"public_url": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?X-Amz-Algorithm=..."
}
}PUT your file to upload_url with the matching Content-Type, then use public_url as audioUrl below.
Send the audioUrl. The gateway queues a job and returns 202 immediately. Poll /v2/jobs/{job_id} until status=completed; result_data carries the voice_id.
{
"audioUrl": "https://s3.eu-central-1.wasabisys.com/.../speaker.wav?...",
"provider": "elevenlabs"
}Response
{
"status": "accepted",
"message": "Job queued for processing",
"job": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"service": "core-voice-clone",
"created_at": "2026-04-27T10:30:00Z"
}
}Then poll `GET /v2/jobs/{job_id}` until status=completed. Final result_data: ``` { "status": "completed", "result_data": { "status": "success", "message": "Voice cloned successfully with elevenlabs", "voice_id": "abc123xyz...", "provider": "elevenlabs" }, "units_consumed": 1.0, "token_cost": 5.0 } ``` Use voice_id with core-tts.
One request, one response — but the response is BINARY audio/wav, not JSON. Save the body to a .wav file.
{
"audioUrl": "https://s3.eu-central-1.wasabisys.com/.../sample.wav?...",
"text": "Hello, this is my cloned voice speaking.",
"provider": "elevenlabs"
}Response
# HTTP/1.1 200 OK # Content-Type: audio/wav # X-Used-Voice-Id: abc123xyz # X-Provider: elevenlabs # X-Text-Length: 41 # X-Gateway-Request-ID: req_abc # X-Gateway-Units: 1.0 # X-Gateway-Token-Cost: 2.000000 # X-Gateway-Credit-Balance: 99.000000 <binary WAV bytes>
- Response is RAW audio/wav bytes — no JSON envelope. - Inspect the response headers for metadata (X-Used-Voice-Id, X-Provider, X-Text-Length) and gateway billing (X-Gateway-* headers). - In curl: `curl ... --output out.wav` to save directly.
Once you have a voice_id from step 2a, synthesize as many clips as you want at standard core-tts pricing.
{
"text": "This is generated with my cloned voice.",
"voice_id": "abc123xyz...",
"provider": "elevenlabs"
}Response
{
"status": "success",
"data": {
"url": "https://s3.example.com/audio/tts/.../abc.mp3",
"key": "audio/tts/.../abc.mp3",
"folder_id": "...",
"provider": "elevenlabs",
"voice_id": "abc123xyz...",
"cloned_voice_id": null,
"duration_seconds": 2.1
},
"usage": { "units": 2.1, "unit_type": "second" },
"cost": { "units": 2.1, "unit_price": 0.01, "tokens": 0.021, "balance": 96.97 }
}Standard per-second TTS pricing (no extra fee for cloned voice). voice_id must match the provider it was cloned with.
Persistent voice clone. Async/delayed: returns 202 + job_id, processed by the gateway worker. Poll /v2/jobs/{job_id} for the voice_id.
/v1/proxy/core-voice-clone
Profile the voice from audioUrl AND synthesise the text in one synchronous request. Response is binary audio/wav — NOT JSON. No stored voice profile.
/v1/proxy/core-instant-clone-tts
Shared with the TTS product — see tts.yaml for full schema. Use it here to confirm provider capability flags (supports_voice_cloning) before sending a clone request.
/v1/proxy/core-tts-voices
Persistent clone is a one-time per-clone fee; instant clone is per-request; subsequent TTS uses are billed at standard per-second rates.
| Service | Unit | Price |
|---|---|---|
| Persistent Clone | item | $5.0/clone (one-time) |
| Instant Clone + TTS | item | $2.0/request |
| TTS with cloned voice | second | $0.01/second (standard core-tts rate) |
| List Voices | item | Free |
A: Async, typically 1-3 minutes. Poll GET /v2/jobs/{job_id} until status=completed; result_data includes voice_id.
A: Upstream Pydantic uses camelCase here (legacy naming). Match the field name exactly or you'll get a 422.
A: The upstream streams the synthesised .wav directly without uploading to storage first (saves a round-trip). Save the response body as a .wav file. Metadata is in headers.
A: Yes — pass the video URL as audioUrl. The upstream extracts up to ~60s of audio via ffmpeg before profiling.
A: No. For >~3 clips in the same voice, persistent ($5 once) + TTS ($0.01/s) is cheaper. Instant ($2/request) wins for 1-2 short clips.
A: After the persistent clone job completes, pass the returned voice_id to core-tts via the voice_id field. Make sure provider matches what the clone was made with.
A: Use the Voice Clone Dubbing product (voice-clone-dubbing.yaml / core-dubbing-clone) which handles cloning + dubbing in one pipeline.
1.1 (2026-04-27)
1.0 (2026-01-26)