Voice Cloning at Scale: ElevenLabs, Cartesia, and Sesame Compared
Three companies now define the frontier of synthetic voice: ElevenLabs, the $3.3 billion New York–based startup that cornered the creator market; Cartesia, the Berkeley-founded real-time streaming specialist quietly winning enterprise IVR contracts; and Sesame, the lab backed by Oculus co-founder Brendan Iribe that open-sourced its Conversational Speech Model (CSM) in March 2025 and set a new bar for emotional naturalness in dialogue. Together they are racing to solve three distinct problems: how fast a voice can start speaking, how indistinguishable it sounds from the original speaker, and whether cloning one minute of audio is legally and ethically defensible.
This article cites dated benchmark numbers where available. Revenue estimates and architecture inferences are labeled clearly as conjecture. No generic AI puff — just numbers, tradeoffs, and law.
The Millisecond That Matters: Latency Race
In voice AI, time-to-first-audio (TTFA) — the gap between the last input character and the first played audio chunk — is the north-star metric for real-time applications. Every platform publishes its own numbers; read them skeptically, since they measure from their own edge infrastructure under favorable conditions.
- ElevenLabs Flash v2.5 (released November 2024): median TTFA of 75 ms on hosted English inputs, per their published launch documentation. Their prior Turbo v2 model ran at roughly 250–300 ms — a 3–4× improvement in a single model generation.
- Cartesia Sonic 2 (released October 2024): 71 ms median TTFA with a p95 of 130 ms, per Cartesia's public latency dashboard. The architectural advantage is genuine: Sonic is built on a state-space model (SSM) backbone — comparable to the Mamba family — that processes each audio token in O(1) time rather than the O(n²) attention passes required by transformer decoders. Latency stays flat regardless of how much prior audio context has been generated.
- Sesame CSM-1B (open-sourced March 13, 2025): designed for conversational quality, not sub-100 ms throughput. Running the published weights on a single A100 GPU produces full-sentence audio in roughly 400–600 ms depending on sentence length — acceptable for a turn-taking dialogue agent, prohibitive for IVR.
Quality: What the Benchmarks Actually Measure
Before diving into numbers, the fastest path to an opinion is direct listening. ElevenLabs maintains a live demo at elevenlabs.io/text-to-speech, Cartesia hosts playable streaming samples at cartesia.ai, and Sesame published video demos of CSM alongside their March 2025 open-source release. Third-party blind comparisons have circulated on Hugging Face Spaces since early 2025 — search TTS Arena on Hugging Face for community-ranked pairwise listening results.
For structured evaluation, the industry uses MOS (Mean Opinion Score, 1–5) and MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor, 0–100). Both require paid human rater panels. Companies publish their own evaluations, which should be treated as lower bounds; they rarely invite adversarial testers.
ElevenLabs published a Comparative Mean Opinion Score (CMOS) evaluation for Eleven Multilingual v2 in mid-2024, reporting CMOS ≈ +0.3 relative to a human narrator baseline (human baseline: 4.5/5 MOS). That is a near-parity claim on prosodic naturalness for English. ElevenLabs Professional Voice Clone is widely regarded by practitioners as the quality benchmark for long-form narration: it maintains speaker-identity consistency over very long outputs and handles complex intonation contours that cheaper TTS models flatten into monotone.
Cartesia Sonic 2 closes the gap on intelligibility and significantly outperforms transformer-based models on streaming consistency — the voice does not drift during a continuous 30-minute stream, a problem that compounds in transformer decoders as the generation context grows. The O(1) per-token architecture is not only a latency advantage; it is also a quality invariant over output length.
Sesame CSM's informal reception after its open-source release was striking among ML practitioners. The model conditions acoustic tokens on prior dialogue history via cross-attention, producing turn-level emotional continuity — hesitation, warmth, emphasis — that no current commercial TTS API ships as a standard feature. The tradeoff: CSM quality degrades measurably on expository monologue. It is tuned for conversation, not narration.
One-Shot Clone vs. Trained Clone: The Fidelity Tradeoff
All three platforms support one-shot voice cloning — synthesizing speech in a target speaker's style from a short reference clip, no fine-tuning required. ElevenLabs calls this Instant Voice Clone and accepts as little as 10 seconds of audio (60–120 seconds yields meaningfully better results). Cartesia's equivalent is their voice embedding API, which converts a WAV file into a speaker vector applied at inference time. Sesame CSM uses prompt-audio conditioning derived from their RQ-VAE audio codec, which encodes fine-grained acoustic detail for speaker matching.
Trained clones — ElevenLabs terms this Professional Voice Clone (PVC) — require at minimum 30 minutes of clean, studio-quality audio and several hours of fine-tuning server-side. The fidelity jump is audible: PVC voices maintain speaker identity across 10,000+ output tokens, while instant-clone voices exhibit "voice blur" — gradual speaker-identity drift — on extended narration. For audiobook production, this distinction determines both the pricing tier and which consent framework the law requires.
Use Cases: Where Each Platform Wins
Audiobooks
ElevenLabs is the dominant choice. Eleven Studio supports multi-voice narration, word-level timestamps (required for audiobook-app sync), and SSML pacing controls. ElevenLabs has worked with Findaway — Spotify's audiobook distribution arm — to pilot AI-narrated titles targeting independent authors who distribute outside Amazon's ACX network. ACX still formally requires human narrators as of early 2026, creating a meaningful addressable segment for non-Audible channels. Speaker consistency over a 10-hour audiobook is the make-or-break requirement; only a trained clone reliably delivers it.
Podcasts
The dominant production use case is dynamic ad insertion: clone the host's voice once, generate sponsor reads on demand at any time, and insert them into any episode without re-recording. ElevenLabs' webhook-to-audio pipeline is well-documented and widely deployed by independent podcast networks. Cartesia is an emerging alternative for teams embedding synthesis inside editing software, where its lower API latency and straightforward WebSocket streaming API reduce integration complexity for real-time preview workflows.
IVR and Conversational Voice Agents
This is Cartesia's home turf. Human turn-taking tolerance sits around 200–300 ms total — leaving barely 100 ms for TTS model inference after accounting for network round-trips. Cartesia Sonic 2 shipped native connectors for Twilio Voice and LiveKit in late 2024, making it straightforward to drop into existing telephony stacks. ElevenLabs offers a WebSocket streaming API that reaches comparable TTFA on short phrases but shows higher variance on multi-sentence outputs. Sesame CSM is not architected for IVR throughput and does not currently offer a production streaming API.
The Disclosure Minefield
Voice cloning sits at the intersection of three regulatory regimes, all enacted or activated in the past 24 months:
- FCC TCPA ruling (February 2024): The FCC voted unanimously to classify AI-generated voices in robocalls as "artificial or prerecorded" under the Telephone Consumer Protection Act, effectively banning them without prior written consent from call recipients. The ruling was a direct response to the fake-Biden robocall targeting New Hampshire Democratic primary voters in January 2024.
- Tennessee ELVIS Act (signed March 21, 2024; effective July 2024): The Ensuring Likeness Voice and Image Security Act adds "voice" to Tennessee's existing right-of-publicity law, giving performers a private right of action against unauthorized AI voice cloning. Named after Elvis Presley, it is the first U.S. statute specifically targeting synthetic voice replication rather than image deepfakes.
- EU AI Act, Article 50 (entered into force August 1, 2024): Requires that AI-generated audio be labeled as such when presented to end-users, with narrow exceptions for clearly labeled satire and parody. C2PA audio provenance metadata — which embeds a cryptographic manifest in the audio file itself — is emerging as the leading technical compliance path for platforms operating across EU member states.
California's AB 2602, signed September 2024, goes further in the entertainment sector: contracts granting studios the right to clone a performer's voice using AI are void unless they explicitly describe the intended use and include separate, itemized compensation terms. This directly affects audiobook producers and podcast networks that ask talent to sign broad voicing rights as standard employment conditions — a common practice as recently as 2023.
Competitive Positioning
The structural read: ElevenLabs dominates creator and mid-market segments on brand recognition and ecosystem breadth. Cartesia leads enterprise real-time workloads on architectural efficiency. Sesame sets quality benchmarks in conversational naturalness that neither rival has matched in a shipping product. All three know it. Expect aggressive feature convergence — or outright acquisition — within 18 months.
Frequently asked
How much audio do I need to clone a voice with ElevenLabs?
What is time-to-first-audio (TTFA) and why does it matter for voice bots?
Is voice cloning legal in the United States?
What makes Sesame CSM different from ElevenLabs or Cartesia?
Do I need to disclose AI-generated voices in a podcast or audiobook?
Sources & further reading
- ElevenLabs Flash v2.5 — ElevenLabs Blog
- Cartesia Sonic 2 announcement — Cartesia Blog
- Sesame AI Labs: Conversational Speech Model (CSM) — GitHub
- FCC: AI-Generated Voices in Robocalls Are Illegal Under TCPA (February 2024)
- Tennessee ELVIS Act — SB 2096 (signed March 21, 2024)
- EU AI Act — Regulation (EU) 2024/1689, Official Journal of the European Union
- California AB 2602 — Digital Replica Protections for Performing Artists (2024)
Last reviewed May 02, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.