Voice agents bill on a completely different rate card from text chat, and the gap is wide enough that engineers used to text-token economics routinely under-budget realtime deployments by 4-6x. As of June 2026, gpt-5.5-realtime — the conversational endpoint that streams audio in and audio out over a persistent WebSocket — bills audio input at $40.00 per 1M tokens and audio output at $80.00 per 1M tokens. That is 8x the text input rate ($5.00) and ~2.7x the text output rate ($30.00) on the same underlying model. Mixed-modality sessions are billed per stream: a turn where the user speaks and the model replies with audio plus a tool-call text payload generates audio input tokens, audio output tokens, and a small text output charge in the same invoice line.
Audio tokens are not characters or seconds — they are a discrete chunked representation of the waveform. The current rule of thumb is roughly 1 audio token per 0.1 seconds of speech at the standard 24kHz sample rate, which works out to ~600 audio tokens per minute of speech in each direction. For a sanity check on input bills, take the speaker's wall-clock minutes, multiply by 600, divide by 1,000,000, and multiply by $40. A 10-minute customer-service call where the user speaks for 4 minutes and the agent speaks for 6 minutes generates ~2,400 input audio tokens and ~3,600 output audio tokens. That is (2,400/1,000,000 × $40) + (3,600/1,000,000 × $80) = $0.096 + $0.288 = $0.384 per call before any tool-use or text overhead.
Worked example — a 5-minute voice agent call. Assume a realistic split: the user speaks for 2 minutes (1,200 input audio tokens), the agent speaks for 3 minutes (1,800 output audio tokens), and the agent also runs two tool calls returning ~400 text output tokens of structured arguments and ~600 text input tokens of tool results echoed back into context. Audio input: 1,200/1M × $40 = $0.048. Audio output: 1,800/1M × $80 = $0.144. Text output (tool calls + final text fragments): 400/1M × $30 = $0.012. Text input (tool results + system prompt of ~1,500 tokens): 2,100/1M × $5 = $0.0105. Total: ~$0.215 per 5-minute call, or roughly $2.58 per hour of live voice. Run 1,000 calls a day and the realtime bill alone is ~$6,450/month — before transcription, before logging, before any LLM fallback.
Whisper-3 transcription, used for asynchronous speech-to-text where you do not need a streamed model response, remains the cheapest audio entrypoint at $0.006 per minute of audio (billed in 1-second increments, minimum 1 second). A 10,000-minute transcription backlog — say a month of recorded support calls — costs exactly $60. The newer whisper-3-large endpoint, which adds diarization and word-level timestamps, bills at $0.011 per minute. For applications that only need post-call analytics rather than live conversation, transcribing with Whisper-3 and then running the transcript through gpt-5.4-mini is roughly 30-50x cheaper than routing the same audio through gpt-5.5-realtime.
Text-to-speech sits on its own rate card and is priced per character rather than per token. The standard tts-1-2026 voice runs $15.00 per 1M characters; the higher-fidelity tts-1-hd-2026 voice runs $30.00 per 1M characters. A 200-word reply averages ~1,100 characters, so a single TTS render costs $0.0165 on standard and $0.033 on HD. The trade-off versus realtime audio output is latency and interruptibility: TTS is non-streaming-friendly for back-and-forth conversation but ~5x cheaper than gpt-5.5-realtime audio output for IVR, notification readouts, and pre-rendered narration. A common production pattern is to use gpt-5.4-mini ($0.75/$4.50 text rates) to draft the response, then route to tts-1-2026 — total cost on that 200-word reply is roughly $0.018 input/output text plus $0.0165 TTS, versus ~$0.10+ if the same content were generated as streamed audio through the realtime endpoint.
Prompt caching applies to realtime sessions but only to the text portion of the prompt — the system message, tool schemas, and any text-form conversation history. Audio tokens themselves are not cached; each chunk of speech is unique enough that the cache cannot match it. The practical implication: structure your realtime system prompt the same way you would for chat — long stable instructions and tool definitions at the front, dynamic per-call context at the back — and the 90% cached-input discount applies to that text portion across the WebSocket session. For a voice agent with a 3,000-token system prompt running 1,000 calls a day, caching the system prefix drops text input cost from $15.00/day to ~$1.65/day. It is a small slice of the realtime bill but stacks cleanly with everything else. Confirm current realtime audio rates against OpenAI's realtime API docs before locking pricing into a customer contract — voice rates have moved twice in the last 12 months.