Real $/hour of audio: the math that actually matters
Voice AI pricing is opaque on purpose. The number that matters is **$/hour of generated audio** — and once you normalize, the gaps are larger than the marketing suggests. Assume ~150 words/minute of speech (industry average for natural narration), ~5 chars/word → ~45,000 characters per hour of audio.
**ElevenLabs at the Pro tier** ($99/mo for 500k chars): you get roughly 11 hours of audio/month, working out to **$9/hour** of audio. At Scale ($330/mo for 2M chars) you get ~44 hours/month = **$7.50/hour**. At Business ($1,320/mo for 11M chars) you get ~244 hours/month = **$5.40/hour**.
**Cartesia at pay-as-you-go** ($0.05/1k chars): 45,000 chars/hour × $0.05/1k = **$2.25/hour** of audio. At the Pro tier ($99/mo for 1M credits = ~22 hours) that's **$4.50/hour** with the predictability of a fixed monthly bill. Cartesia is the cheapest pure-TTS option by a wide margin.
**OpenAI tts-1** at $15/1M chars: 45,000 chars/hour × $15/1M = **$0.68/hour** of audio. tts-1-hd at $30/1M = $1.35/hour. Both are dramatically cheaper than ElevenLabs and Cartesia — but the voice quality and emotional range don't compete with ElevenLabs at the top end.
**OpenAI Realtime API** is the outlier because it's not just TTS — it's a full conversational LLM with audio I/O. Per-hour math depends entirely on how much your users talk vs how much the model responds, but a typical voice agent session of 5 minutes (~750 words user input + ~750 words model output) costs roughly $0.12-0.20 with the full model, or $0.03-0.05 with the mini.
**Verdict on cost**: OpenAI tts-1 wins on pure $/hour for batch narration. Cartesia wins on $/hour for production-quality voice. ElevenLabs is the most expensive per hour but the only one that competes on the emotional/expressive quality bar required for audiobook narration and premium podcast production.