Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

OpenAI Whisper API File Size Limit: Everything You Need to Know

The Whisper API enforces a hard 25 MB per-request file size cap. This is the complete guide: supported formats, real pricing, compression tricks, chunking code, and the newer gpt-4o-transcribe models that change the calculus for long-form audio in 2026.

By DDH Research Team at Digital Dashboard HubUpdated

If you have ever called the OpenAI Whisper API with a long podcast, a recorded Zoom call, or any audio over roughly 30 minutes at standard bitrates, you have seen the error: your file exceeds the 25 MB upload limit. The limit is documented in the OpenAI Speech-to-Text guide and is enforced at the API gateway level — no amount of retry logic will get a 40 MB MP3 through in a single request.

The good news is that every practical workaround is well-understood, takes less than an hour to implement, and in several cases also cuts your transcription cost. This guide covers the limit in detail, the seven supported audio formats, current pricing for whisper-1 and the newer gpt-4o-transcribe family, and the concrete code patterns — ffmpeg one-liners, pydub chunk splitting, silence-based segmentation — that production teams use to handle arbitrarily large files.

If you are also watching your API spend, check out our AI cost optimization checklist and the OpenAI API pricing guide for 2026 — transcription pricing is only one line on most teams' bills, but the same batch-API and compression principles apply across your whole OpenAI spend. To model your exact transcription cost by minute, use our AI Prompt Cost Calculator.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card — AICHAT30 = 30% off Pro.

Whisper API — supported formats and typical file sizes

Feature
Typical bitrate
Size per 60 min
Hits 25 MB limit at
mp3 (128 kbps)128 kbps~57 MB~26 min
mp3 (64 kbps)64 kbps~28 MB~53 min
mp3 (32 kbps mono)32 kbps~14 MB>60 min safe
wav (16kHz mono, 16-bit)256 kbps~110 MB~14 min
m4a (AAC 128 kbps)128 kbps~55 MB~27 min
webm (opus 32 kbps)32 kbps~14 MB>60 min safe
mp4 (audio track only)variesvariesextract audio first
mpeg / mpgasame as mp3~57 MB at 128k~26 min

Sizes are approximate. The 25 MB limit applies to the raw file payload. Compressing to 16kHz mono MP3 at 32 kbps keeps most 90-minute sessions under 20 MB.

What exactly is the 25 MB limit and where does it apply?

The OpenAI transcription endpoint (`POST /v1/audio/transcriptions`) accepts multipart/form-data and enforces a 25 megabyte limit on the uploaded file. This applies to every model that uses this endpoint: whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The limit is on the file payload itself, not on the duration of the audio — so a 10-minute WAV file can easily exceed 25 MB while a 90-minute Opus file might not.

The error you receive when the limit is exceeded is a 413 HTTP status with a JSON body along the lines of `{"error": {"message": "Invalid file size: 40971234 bytes. Maximum allowed size is 26214400 bytes.", "type": "invalid_request_error"}}`. The threshold in bytes is exactly 25 × 1024 × 1024 = 26,214,400 bytes. Note that this is 25 mebibytes (MiB) calculated as powers of 1024, which is slightly different from 25 MB in SI units — practically speaking, files at exactly 25.0 MB displayed by macOS Finder may still pass if they are under 26,214,400 bytes.

The limit has been discussed extensively in OpenAI community forums since the API launched. OpenAI has not raised it since 2023. The architectural reason is straightforward: the API gateway buffers the upload in memory before forwarding to the inference cluster, and a hard cap prevents runaway large uploads from degrading service for other users. There is no plans-based override — even Tier 5 API accounts face the same 25 MB cap.


The seven supported audio formats

OpenAI officially supports seven container and codec combinations: **mp3**, **mp4**, **mpeg**, **mpga**, **m4a**, **wav**, and **webm**. The mp3, mpeg, and mpga extensions are functionally identical (MPEG-1/2 Audio Layer III); the API accepts all three extension spellings. The m4a container typically holds AAC audio. The mp4 container can hold either audio-only or video-with-audio tracks — when you send an mp4, only the audio track is decoded, but the full file size (including any video stream) counts against the 25 MB limit. This is a common gotcha: a 10-minute 720p screen recording in mp4 might be 300 MB even though the audio content alone would be 10 MB. Always strip video tracks before uploading mp4 files.

WAV is the worst format for this limit. Uncompressed PCM WAV at 44.1 kHz stereo runs roughly 10 MB per minute — a 3-minute recording can already approach 30 MB. The Whisper model itself was trained primarily on 16kHz mono audio, so sending 44.1 kHz stereo gives you no quality benefit over sending 16kHz mono, and costs roughly 5× more bytes. If you are generating WAV files from a recording pipeline, re-encode to 16kHz mono before calling the API.

WebM/Opus is the best format for file-size efficiency. The Opus codec at 32 kbps delivers audio quality that is fully adequate for speech recognition at roughly 1/4 the file size of a 128 kbps MP3. Many browser-based recording apps (MediaRecorder API) already output webm/opus natively — you may be able to send that file directly without any re-encoding. Check your file sizes: a 60-minute webm/opus recording at 32 kbps is about 14 MB, comfortably under the 25 MB cap.


Current Whisper API pricing in 2026

The whisper-1 model is priced at **$0.006 per minute** of audio (rounded up to the nearest second), as listed on OpenAI's pricing page. This is duration-based pricing — the file size and the format do not affect the per-minute rate. A 60-minute interview costs $0.36 to transcribe regardless of whether you send it as a 14 MB webm or a 57 MB mp3 (assuming you have a workaround for the file size limit).

The newer **gpt-4o-transcribe** and **gpt-4o-mini-transcribe** models were introduced in 2025-2026 and are also available on the same `/v1/audio/transcriptions` endpoint with the same 25 MB file limit. Pricing for gpt-4o-transcribe is higher than whisper-1 but the model provides substantially better accuracy on accented speech, technical vocabulary, and noisy audio. The gpt-4o-mini-transcribe model is positioned between whisper-1 and gpt-4o-transcribe on both price and accuracy. All three models use per-minute pricing. For cost modeling across your full usage, use our AI Prompt Cost Calculator — it includes the current transcription rates.

One important note: the LLM rate limits guide covers both token-per-minute and request-per-minute limits for the transcription endpoint. At Tier 1 you are limited to 50 requests per minute on audio endpoints. If you are chunking a long file into many small segments and sending them in parallel, you can hit this rate limit faster than you expect. For high-volume transcription pipelines, batch the requests with backoff or move to a higher API tier.


Workaround 1: Compress and downsample before uploading

The fastest workaround for files that are slightly over the limit — anything under about 90 minutes at standard podcast quality — is simply compressing the audio before upload. The Whisper model processes audio at 16kHz internally regardless of input sample rate, so downsampling to 16kHz mono and re-encoding at a low bitrate loses zero transcription quality while dramatically reducing file size.

The ffmpeg one-liner is: `ffmpeg -i input.mp3 -ar 16000 -ac 1 -b:a 32k output.mp3`. This sets the sample rate to 16kHz (`-ar 16000`), forces mono (`-ac 1`), and targets 32 kbps (`-b:a 32k`). A 90-minute file that was 63 MB at 128 kbps stereo will typically come out at around 21 MB after this transform — under the 25 MB cap with room to spare.

For WAV inputs specifically, the same command applies, and the savings are even more dramatic. A 90-minute 44.1 kHz stereo WAV at 10 MB/min is 900 MB — utterly unusable without compression. After downsampling to 16kHz mono: `ffmpeg -i input.wav -ar 16000 -ac 1 output.wav`, the file is about 165 MB (WAV is still uncompressed). Pipe through MP3 or Opus encoding to get it below 25 MB: `ffmpeg -i input.wav -ar 16000 -ac 1 -b:a 32k output.mp3`.


Workaround 2: Chunk long files with ffmpeg

For files longer than about 90 minutes, even maximum compression may not get the file under 25 MB. The standard production approach is to split the audio into segments, transcribe each one, and concatenate the results. The main challenge is preserving context across segment boundaries — words at the end of one chunk may be cut mid-sentence, and the transcript of the next chunk begins without that context.

The ffmpeg segment filter handles this cleanly: `ffmpeg -i input.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3`. This splits the input into 10-minute segments (`-segment_time 600` seconds) using stream-copy so there is no re-encoding quality loss. For a 90-minute file you get 9 chunks, each around 9 MB at 128 kbps — all well under the 25 MB limit. For a 3-hour file with the compression step first, you might do: `ffmpeg -i input.mp3 -ar 16000 -ac 1 -b:a 32k compressed.mp3` then split.

After transcribing all chunks, join the transcripts in order. If you are using the `response_format=verbose_json` option, each segment includes a `start` and `end` timestamp relative to the chunk. You need to offset those timestamps by the start time of each chunk to reconstruct a single timeline. A simple Python snippet: `for i, transcript in enumerate(transcripts): offset = i * chunk_duration_seconds; for seg in transcript['segments']: seg['start'] += offset; seg['end'] += offset`. The OpenAI Speech-to-Text guide covers timestamp handling in verbose_json output.

One important quality tip: add a 1-2 second overlap between chunks by adjusting the segment start points. This gives the model context on the second chunk for words that were cut at the boundary. With ffmpeg segments, you can do this by splitting at silence points instead of fixed intervals (covered in the next section).


Workaround 3: Split on silence with pydub

Fixed-duration splitting occasionally cuts mid-word, which can result in a missed word at the boundary in one transcript and a repeated word in the next. A cleaner approach is splitting at silence points — natural pauses between sentences. The pydub library makes this straightforward in Python.

Install pydub and ffmpeg: `pip install pydub`. Then: `from pydub import AudioSegment; from pydub.silence import split_on_silence; audio = AudioSegment.from_file('input.mp3'); chunks = split_on_silence(audio, min_silence_len=700, silence_thresh=-40, keep_silence=300)`. This splits wherever there is at least 700ms of audio below -40 dBFS, keeping 300ms of silence on each side of the cut for natural-sounding boundaries. The `silence_thresh` value depends on your recording quality — quiet rooms may need -50 or lower; noisy environments may need -30 or higher.

After splitting, recombine chunks that are too short (under ~30 seconds) with the following chunk to avoid creating many tiny API calls: `target_size_ms = 8 * 60 * 1000; combined = []; current = AudioSegment.empty(); for chunk in chunks: current += chunk; if len(current) >= target_size_ms: combined.append(current); current = AudioSegment.empty(); if len(current) > 0: combined.append(current)`. Then export and upload each combined chunk. This approach avoids mid-word cuts, keeps each upload well under 25 MB, and produces cleaner transcript boundaries than fixed-time splitting.

Pydub is discussed in several OpenAI community threads as the recommended Python-native solution. The library does depend on ffmpeg being installed on the system — on macOS: `brew install ffmpeg`, on Ubuntu/Debian: `apt install ffmpeg`.


Workaround 4: Use gpt-4o-transcribe or gpt-4o-mini-transcribe

The gpt-4o-transcribe and gpt-4o-mini-transcribe models share the same endpoint and the same 25 MB limit as whisper-1, so they are not a workaround for the size cap itself. However, they are worth considering as part of a cost-efficiency strategy for chunked transcription pipelines. Both models have improved word error rate — especially on accented English, technical jargon, and low-quality audio — which means fewer post-processing corrections and cleaner output from each chunk.

For production pipelines that already implement chunking, switching from whisper-1 to gpt-4o-mini-transcribe may improve accuracy at a modest cost increase. The trade-off is the same as elsewhere in the OpenAI model family — see our OpenAI API pricing 2026 guide for the full per-minute rate comparison. For most transcription use cases where quality matters (legal, medical, content publishing), gpt-4o-transcribe is the current recommended model. For bulk, cost-sensitive transcription where some errors are acceptable, whisper-1 at $0.006/min remains the cheapest option.

One meaningful difference: gpt-4o-transcribe supports a `prompt` parameter that lets you seed the model with domain-specific vocabulary, speaker names, or expected acronyms. This is extremely useful at chunk boundaries — pass the last 2-3 sentences of the previous chunk as the prompt for the next chunk to give the model continuity context. Whisper-1 supports the same `prompt` parameter; it was just less effective on the older model.


Workaround 5: Use the Whisper prompt parameter for continuity

The `prompt` parameter in the transcription API is underused. According to the OpenAI Speech-to-Text documentation, you can pass a string of up to 224 tokens that the model will use as context. This serves two purposes: it corrects spelling of domain-specific words (e.g., `prompt: "DDH, Beehiiv, Zapier, webhook"` will cause those words to be transcribed correctly even if the speaker mumbles them), and it provides cross-chunk continuity.

For cross-chunk continuity, the pattern is: transcribe chunk N, extract the last sentence or two from the result, and pass that text as the `prompt` when transcribing chunk N+1. This gives the model context about what was just said, reducing boundary errors substantially. Example implementation: `previous_text = transcripts[i-1]['text'][-300:]; response = client.audio.transcriptions.create(model='whisper-1', file=chunk_file, prompt=previous_text)`.

This is particularly important for proper nouns, speaker names, and technical vocabulary that appear across chunk boundaries. A speaker saying "...and that brings me to our next topic, the Kubernetes deployment" split at "Kubernetes de-" and "-ployment" across two chunks will get confused without the prompt. With the last sentence as prompt context on the second chunk, the model typically handles it correctly.


Choosing the right strategy for your use case

The right workaround depends on your file lengths and throughput requirements. For files under 45 minutes at 128 kbps MP3: just run the ffmpeg compression command to 16kHz mono 32 kbps and send a single request — no chunking needed. For files between 45 and 90 minutes: compression plus a single split at the midpoint is usually sufficient. For files over 90 minutes or variable-length recordings: build the pydub silence-based splitter once and reuse it across all your audio.

For high-throughput pipelines transcribing hundreds of files per day, the architecture worth building is: (1) compress all uploads to 16kHz mono MP3 on ingest regardless of length, (2) check file size after compression — if under 25 MB, send directly; if over, split on silence, (3) send chunks in parallel up to the rate limit (50 requests/min on standard tiers), (4) concatenate results with timestamp offsets. This handles any file length with no manual intervention.

Cost-wise, remember that the $0.006/min rate is on audio duration, not on the number of API calls. Splitting a 60-minute file into 6 chunks and transcribing them separately costs exactly the same as transcribing a single 60-minute file — $0.36 either way. The chunking overhead is in engineering and compute time, not in API cost. For cost management across your full AI stack, see our how much does ChatGPT cost guide and the AI cost optimization checklist.


Common errors and how to fix them

**413 Request Entity Too Large** — your file is over 26,214,400 bytes. Apply compression (`ffmpeg -ar 16000 -ac 1 -b:a 32k`) and/or chunking. Check the compressed size with `wc -c output.mp3` before uploading.

**400 Invalid file format** — you are sending a format not in the supported list (mp3, mp4, mpeg, mpga, m4a, wav, webm), or the file extension does not match the actual codec. A common case is an OGG/Vorbis file renamed to .mp3. Run `ffprobe input_file` to check the actual codec, then re-encode to a supported format with ffmpeg.

**400 Invalid audio format: file appears to be empty** — this usually means the ffmpeg command ran but produced a zero-byte output due to an unsupported input codec or a missing ffmpeg codec library. Check `ffmpeg -i input_file` for error messages. On some minimal ffmpeg builds, certain codecs (AAC in particular) are not included — install a full build via `brew install ffmpeg` on macOS or the `ffmpeg` package from the distro's main repos on Linux.

**Transcription quality is poor on a chunked file** — most common cause is missing `prompt` continuity between chunks (see the prompt parameter section above). Second cause is chunk boundaries falling mid-word due to fixed-time splitting rather than silence splitting. Third cause is very low bitrate compression — 32 kbps mono is sufficient for clear speech but may degrade quality on heavily accented or noisy audio. Try 48 kbps if quality is marginal. Also verify the audio is actually audible — muted or very low-volume recordings will produce garbage transcripts regardless of model.

**Rate limit hit (429) when processing many chunks** — the audio transcription endpoint has a separate rate limit from the chat completions endpoint. At Tier 1 this is 50 requests per minute. If you are sending 10-minute chunks in parallel for a 3-hour file (18 chunks), you are at 18 parallel requests which is within Tier 1 limits. But if you are processing many files simultaneously, you can hit the RPM limit. Implement exponential backoff: catch 429s and retry after `wait_time = 2 ** attempt + random.uniform(0, 1)` seconds. See our LLM rate limits 2026 guide for the full rate limit table by tier.


Python reference implementation

Here is a complete, production-ready Python function that handles any audio file regardless of size. It compresses first, checks the size, splits on silence only if needed, transcribes all chunks in sequence with prompt continuity, and returns a single concatenated transcript with corrected timestamps.

```python import os import math import subprocess from pathlib import Path from pydub import AudioSegment from pydub.silence import split_on_silence from openai import OpenAI client = OpenAI() MAX_BYTES = 25 * 1024 * 1024 # 25 MB CHUNK_TARGET_MS = 8 * 60 * 1000 # 8 minutes def compress_audio(input_path: str, output_path: str) -> str: """Downsample to 16kHz mono MP3 at 32kbps.""" subprocess.run([ "ffmpeg", "-i", input_path, "-ar", "16000", "-ac", "1", "-b:a", "32k", "-y", output_path ], check=True, capture_output=True) return output_path def transcribe_any(input_path: str, model: str = "whisper-1") -> dict: compressed_path = input_path + "_compressed.mp3" compress_audio(input_path, compressed_path) if os.path.getsize(compressed_path) <= MAX_BYTES: with open(compressed_path, "rb") as f: result = client.audio.transcriptions.create( model=model, file=f, response_format="verbose_json" ) os.remove(compressed_path) return result # File still too large after compression — split on silence audio = AudioSegment.from_file(compressed_path) chunks = split_on_silence( audio, min_silence_len=700, silence_thresh=-40, keep_silence=300 ) # Merge small chunks up to target duration merged, current = [], AudioSegment.empty() for chunk in chunks: current += chunk if len(current) >= CHUNK_TARGET_MS: merged.append(current) current = AudioSegment.empty() if len(current) > 0: merged.append(current) segments_all = [] offset_ms = 0 prompt = "" for i, chunk in enumerate(merged): chunk_path = f"/tmp/chunk_{i:04d}.mp3" chunk.export(chunk_path, format="mp3", bitrate="32k") with open(chunk_path, "rb") as f: result = client.audio.transcriptions.create( model=model, file=f, response_format="verbose_json", prompt=prompt ) # Offset timestamps offset_sec = offset_ms / 1000 for seg in result.segments: seg["start"] += offset_sec seg["end"] += offset_sec segments_all.append(seg) # Set prompt for next chunk prompt = result.text[-300:] if len(result.text) > 300 else result.text offset_ms += len(chunk) os.remove(chunk_path) os.remove(compressed_path) return {"text": " ".join(s["text"] for s in segments_all), "segments": segments_all} ```

This implementation handles files of any length, preserves timestamps, and threads prompt context across chunk boundaries. The compression step alone will get most files under the limit; the silence-based splitting only activates for files that remain over 25 MB after compression (roughly 90+ minutes of audio). To adapt for gpt-4o-transcribe, change the default `model` parameter — the rest of the function is identical since both models use the same endpoint and response format.


How Whisper file sizes compare to other AI API inputs

The 25 MB limit is specific to the audio transcription endpoint. Other OpenAI API endpoints have different payload constraints. The vision endpoint (chat completions with image_url or base64 image) accepts images up to 20 MB per image, with a limit on total image dimensions rather than raw file size. The Assistants API file upload endpoint accepts files up to 512 MB per file and 100 GB per organization — dramatically more permissive. The fine-tuning data upload endpoint accepts files up to 1 GB.

The transcription limit is low relative to what modern audio files look like in practice. A single one-hour podcast at typical podcast bitrates is 55-60 MB. A recorded team meeting at standard quality is often 80-100 MB for 90 minutes. The 25 MB cap means that almost any real-world audio recording of a full meeting or interview requires either compression or chunking — it is not an edge case.

For context on how these constraints fit into broader API usage patterns, see our LLM rate limits 2026 guide, which covers both file size and request rate limits across the major API endpoints. And if you are thinking about voice cloning or speech synthesis in addition to transcription, our best AI voice cloning tools guide covers the adjacent tooling landscape.

Continue your research on adjacent topics — calculators, rate limits, head-to-head comparisons, and guides.

Frequently Asked Questions

What is the OpenAI Whisper API file size limit?

The limit is 25 MB (exactly 26,214,400 bytes) per request. This applies to all models on the `/v1/audio/transcriptions` endpoint: whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe. There is no plan or tier that raises this limit.

What audio formats does the Whisper API support?

Seven formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm. The mp3/mpeg/mpga entries are effectively the same codec. For best file-size efficiency, webm/opus or low-bitrate mp3 at 32 kbps mono are the recommended choices.

How much does Whisper API transcription cost?

Whisper-1 costs $0.006 per minute of audio, billed to the nearest second. A 60-minute interview costs $0.36. Pricing is based on audio duration, not file size or number of API calls — chunking a file does not change the total cost.

What is the fastest way to get a large file under the 25 MB limit?

Run `ffmpeg -i input.mp3 -ar 16000 -ac 1 -b:a 32k output.mp3`. This downsamples to 16kHz mono at 32 kbps, which is all the Whisper model needs for speech recognition. A 90-minute 128 kbps MP3 (~63 MB) becomes ~21 MB after this transformation. If the file is still over 25 MB after compression, you need to split it into chunks.

Does chunking audio reduce transcription accuracy?

Slightly, at chunk boundaries, if you do not use the `prompt` parameter for continuity. With prompt chaining (passing the last ~300 characters of the previous chunk's transcript as the `prompt` for the next chunk), accuracy at boundaries is nearly identical to a single-file transcription. Using silence-based splitting rather than fixed-time splitting also helps avoid mid-word cuts.

Is gpt-4o-transcribe better than whisper-1 for chunked files?

For accuracy, yes — especially on accented speech, technical vocabulary, and low-quality audio. Both models have the same 25 MB limit and support the same `prompt` parameter for chunk continuity. The cost is higher for gpt-4o-transcribe; use whisper-1 for high-volume, cost-sensitive pipelines and gpt-4o-transcribe when accuracy is critical.

Can I send audio longer than 25 MB if I use the Assistants API?

The Assistants API file upload accepts files up to 512 MB, but transcription is not a built-in Assistants capability — you still need to call the transcription endpoint directly. The 25 MB limit applies to the transcription endpoint regardless of whether you orchestrate it through the Assistants API or call it directly.

Why does my WAV file hit the limit so fast?

Uncompressed PCM WAV at 44.1 kHz stereo runs roughly 10 MB per minute. A 3-minute recording can be 30 MB. Always convert WAV files before uploading: `ffmpeg -i input.wav -ar 16000 -ac 1 -b:a 32k output.mp3`. The Whisper model processes audio at 16kHz internally, so downsampling from 44.1 kHz to 16kHz loses zero transcription quality.

Know your exact transcription cost before you build.

Paste your monthly audio minutes into our AI Prompt Cost Calculator to get the line-item cost across whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe — then grab a prompt template from DDH Pro optimized for the model tier that fits your budget.

Browse all prompt tools →