Voice Cloning at Scale: ElevenLabs, Cartesia, and Sesame Compared

AI Innovation Published May 02, 2026 · voice cloning · text-to-speech · elevenlabs · cartesia · sesame csm

Three companies now define the frontier of synthetic voice: ElevenLabs, the $3.3 billion New York–based startup that cornered the creator market; Cartesia, the Berkeley-founded real-time streaming specialist quietly winning enterprise IVR contracts; and Sesame, the lab backed by Oculus co-founder Brendan Iribe that open-sourced its Conversational Speech Model (CSM) in March 2025 and set a new bar for emotional naturalness in dialogue. Together they are racing to solve three distinct problems: how fast a voice can start speaking, how indistinguishable it sounds from the original speaker, and whether cloning one minute of audio is legally and ethically defensible.

This article cites dated benchmark numbers where available. Revenue estimates and architecture inferences are labeled clearly as conjecture. No generic AI puff — just numbers, tradeoffs, and law.

The Millisecond That Matters: Latency Race

In voice AI, time-to-first-audio (TTFA) — the gap between the last input character and the first played audio chunk — is the north-star metric for real-time applications. Every platform publishes its own numbers; read them skeptically, since they measure from their own edge infrastructure under favorable conditions.

ElevenLabs Flash v2.5 (released November 2024): median TTFA of 75 ms on hosted English inputs, per their published launch documentation. Their prior Turbo v2 model ran at roughly 250–300 ms — a 3–4× improvement in a single model generation.
Cartesia Sonic 2 (released October 2024): 71 ms median TTFA with a p95 of 130 ms, per Cartesia's public latency dashboard. The architectural advantage is genuine: Sonic is built on a state-space model (SSM) backbone — comparable to the Mamba family — that processes each audio token in O(1) time rather than the O(n²) attention passes required by transformer decoders. Latency stays flat regardless of how much prior audio context has been generated.
Sesame CSM-1B (open-sourced March 13, 2025): designed for conversational quality, not sub-100 ms throughput. Running the published weights on a single A100 GPU produces full-sentence audio in roughly 400–600 ms depending on sentence length — acceptable for a turn-taking dialogue agent, prohibitive for IVR.

Practical note: API-measured latency diverges sharply from self-hosted numbers. Cartesia's 71 ms is from their edge nodes; routing to a mismatched region adds 80–120 ms of network round-trip before any audio arrives at the client.

Quality: What the Benchmarks Actually Measure

Before diving into numbers, the fastest path to an opinion is direct listening. ElevenLabs maintains a live demo at elevenlabs.io/text-to-speech, Cartesia hosts playable streaming samples at cartesia.ai, and Sesame published video demos of CSM alongside their March 2025 open-source release. Third-party blind comparisons have circulated on Hugging Face Spaces since early 2025 — search TTS Arena on Hugging Face for community-ranked pairwise listening results.

For structured evaluation, the industry uses MOS (Mean Opinion Score, 1–5) and MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor, 0–100). Both require paid human rater panels. Companies publish their own evaluations, which should be treated as lower bounds; they rarely invite adversarial testers.

ElevenLabs published a Comparative Mean Opinion Score (CMOS) evaluation for Eleven Multilingual v2 in mid-2024, reporting CMOS ≈ +0.3 relative to a human narrator baseline (human baseline: 4.5/5 MOS). That is a near-parity claim on prosodic naturalness for English. ElevenLabs Professional Voice Clone is widely regarded by practitioners as the quality benchmark for long-form narration: it maintains speaker-identity consistency over very long outputs and handles complex intonation contours that cheaper TTS models flatten into monotone.

Cartesia Sonic 2 closes the gap on intelligibility and significantly outperforms transformer-based models on streaming consistency — the voice does not drift during a continuous 30-minute stream, a problem that compounds in transformer decoders as the generation context grows. The O(1) per-token architecture is not only a latency advantage; it is also a quality invariant over output length.

Sesame CSM's informal reception after its open-source release was striking among ML practitioners. The model conditions acoustic tokens on prior dialogue history via cross-attention, producing turn-level emotional continuity — hesitation, warmth, emphasis — that no current commercial TTS API ships as a standard feature. The tradeoff: CSM quality degrades measurably on expository monologue. It is tuned for conversation, not narration.

One-Shot Clone vs. Trained Clone: The Fidelity Tradeoff

All three platforms support one-shot voice cloning — synthesizing speech in a target speaker's style from a short reference clip, no fine-tuning required. ElevenLabs calls this Instant Voice Clone and accepts as little as 10 seconds of audio (60–120 seconds yields meaningfully better results). Cartesia's equivalent is their voice embedding API, which converts a WAV file into a speaker vector applied at inference time. Sesame CSM uses prompt-audio conditioning derived from their RQ-VAE audio codec, which encodes fine-grained acoustic detail for speaker matching.

Trained clones — ElevenLabs terms this Professional Voice Clone (PVC) — require at minimum 30 minutes of clean, studio-quality audio and several hours of fine-tuning server-side. The fidelity jump is audible: PVC voices maintain speaker identity across 10,000+ output tokens, while instant-clone voices exhibit "voice blur" — gradual speaker-identity drift — on extended narration. For audiobook production, this distinction determines both the pricing tier and which consent framework the law requires.

Instant Clone

10 seconds – 2 minutes of source audio

Fast setup. Good for short-form content, ad reads, and IVR persona creation. Quality degrades past roughly 10 minutes of continuous output. Speaker drift is the primary failure mode on long narration.

Trained Clone (Professional)

30+ minutes of studio audio; hours of server-side fine-tuning

High fidelity over long outputs. Required for audiobooks, podcast body narration, and dubbing. Requires explicit consent agreements under multiple U.S. state laws — see Disclosure section below.

Use Cases: Where Each Platform Wins

Audiobooks

ElevenLabs is the dominant choice. Eleven Studio supports multi-voice narration, word-level timestamps (required for audiobook-app sync), and SSML pacing controls. ElevenLabs has worked with Findaway — Spotify's audiobook distribution arm — to pilot AI-narrated titles targeting independent authors who distribute outside Amazon's ACX network. ACX still formally requires human narrators as of early 2026, creating a meaningful addressable segment for non-Audible channels. Speaker consistency over a 10-hour audiobook is the make-or-break requirement; only a trained clone reliably delivers it.

Podcasts

The dominant production use case is dynamic ad insertion: clone the host's voice once, generate sponsor reads on demand at any time, and insert them into any episode without re-recording. ElevenLabs' webhook-to-audio pipeline is well-documented and widely deployed by independent podcast networks. Cartesia is an emerging alternative for teams embedding synthesis inside editing software, where its lower API latency and straightforward WebSocket streaming API reduce integration complexity for real-time preview workflows.

IVR and Conversational Voice Agents

This is Cartesia's home turf. Human turn-taking tolerance sits around 200–300 ms total — leaving barely 100 ms for TTS model inference after accounting for network round-trips. Cartesia Sonic 2 shipped native connectors for Twilio Voice and LiveKit in late 2024, making it straightforward to drop into existing telephony stacks. ElevenLabs offers a WebSocket streaming API that reaches comparable TTFA on short phrases but shows higher variance on multi-sentence outputs. Sesame CSM is not architected for IVR throughput and does not currently offer a production streaming API.

The Disclosure Minefield

Voice cloning sits at the intersection of three regulatory regimes, all enacted or activated in the past 24 months:

FCC TCPA ruling (February 2024): The FCC voted unanimously to classify AI-generated voices in robocalls as "artificial or prerecorded" under the Telephone Consumer Protection Act, effectively banning them without prior written consent from call recipients. The ruling was a direct response to the fake-Biden robocall targeting New Hampshire Democratic primary voters in January 2024.
Tennessee ELVIS Act (signed March 21, 2024; effective July 2024): The Ensuring Likeness Voice and Image Security Act adds "voice" to Tennessee's existing right-of-publicity law, giving performers a private right of action against unauthorized AI voice cloning. Named after Elvis Presley, it is the first U.S. statute specifically targeting synthetic voice replication rather than image deepfakes.
EU AI Act, Article 50 (entered into force August 1, 2024): Requires that AI-generated audio be labeled as such when presented to end-users, with narrow exceptions for clearly labeled satire and parody. C2PA audio provenance metadata — which embeds a cryptographic manifest in the audio file itself — is emerging as the leading technical compliance path for platforms operating across EU member states.

California's AB 2602, signed September 2024, goes further in the entertainment sector: contracts granting studios the right to clone a performer's voice using AI are void unless they explicitly describe the intended use and include separate, itemized compensation terms. This directly affects audiobook producers and podcast networks that ask talent to sign broad voicing rights as standard employment conditions — a common practice as recently as 2023.

Conjecture, marked clearly: As of May 2026, no federal right-of-publicity statute covers AI voice cloning in the United States. The patchwork of state laws — Tennessee, California, New York, Georgia, and at least eight others enacted in 2024–2025 — creates material compliance uncertainty for any platform serving customers across state lines. Federal consolidation legislation has been introduced in the 119th Congress but has not cleared committee as of this writing. Platforms operating multi-state should maintain state-by-state consent matrices and treat all commercial voice clones as requiring explicit performer authorization until federal clarity arrives.

Competitive Positioning

Conjecture / Estimate (May 2026): ElevenLabs raised $180M in a Series C in January 2025 at a reported $3.3 billion valuation. Applying the 20–30× ARR multiples typical of top-decile AI infrastructure companies at that funding stage, implied ARR at Series C close was roughly $110–165M. Cartesia, having raised a $24M Series A in late 2024, is almost certainly pre-$20M ARR; their go-to-market is API-first and developer-led with limited enterprise sales motion visible publicly. Sesame's CSM is open-source under a non-commercial research license — their monetization runs through a proprietary companion product, not a TTS API business.

The structural read: ElevenLabs dominates creator and mid-market segments on brand recognition and ecosystem breadth. Cartesia leads enterprise real-time workloads on architectural efficiency. Sesame sets quality benchmarks in conversational naturalness that neither rival has matched in a shipping product. All three know it. Expect aggressive feature convergence — or outright acquisition — within 18 months.

Frequently asked

How much audio do I need to clone a voice with ElevenLabs?

ElevenLabs Instant Voice Clone accepts as little as 10 seconds of audio, though 60–120 seconds produces noticeably better results. For Professional Voice Clone — used for audiobooks and long-form narration — a minimum of 30 minutes of clean, studio-quality audio is required. The fine-tuning process runs server-side and typically takes several hours to complete.

What is time-to-first-audio (TTFA) and why does it matter for voice bots?

TTFA is the elapsed time between submitting the last text character and hearing the first audio chunk from the model. For conversational voice bots and IVR systems, human turn-taking tolerance is roughly 200–300 ms total — leaving barely 100 ms for TTS inference after network latency is accounted for. ElevenLabs Flash v2.5 claims 75 ms TTFA and Cartesia Sonic 2 claims 71 ms, both measured from their own edge infrastructure rather than from a customer's server.

Is voice cloning legal in the United States?

It depends on jurisdiction and use case. The FCC banned AI-generated voices in robocalls in February 2024 under TCPA. Tennessee's ELVIS Act (effective July 2024) gives performers a private right of action against unauthorized cloning. California's AB 2602 (signed September 2024) voids entertainment contracts that do not explicitly disclose AI voice-cloning use and provide separate compensation. No single federal law governs the space yet, so compliance requires a state-by-state analysis.

What makes Sesame CSM different from ElevenLabs or Cartesia?

Sesame's Conversational Speech Model conditions each spoken turn on prior conversation history via cross-attention, producing emotional continuity — hesitation, warmth, emphasis — across multi-turn dialogue. ElevenLabs and Cartesia treat each synthesis call independently. The tradeoff: CSM is not optimized for streaming latency or long-form narration, and it is distributed under a non-commercial open-source license rather than as a paid API.

Do I need to disclose AI-generated voices in a podcast or audiobook?

Requirements vary by platform and geography. Amazon's ACX still requires human narrators as of early 2026. The EU AI Act (Article 50) mandates disclosure for AI-generated audio in most public-facing EU contexts. California AB 2602 requires disclosure and compensation if a contracted performer's voice is cloned. Spotify and most U.S. podcast platforms do not yet mandate AI-voice labeling, though industry best-practice guidance from organizations like the RTDNA recommends disclosure regardless.

Sources & further reading

Last reviewed May 02, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.