Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LLM Context Window Comparison 2026: Max Input & Output Tokens for Every Major Model

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

A model's context window is the maximum number of tokens it can read in a single request, and the output cap is the maximum it can return in a single response — two separate numbers that vendors quote separately. As of June 2026, the practical window range runs from 128k tokens at the small end (older Llama and Mistral builds) to 2,000,000 tokens at the large end (Gemini 3.1 Pro Preview), with most flagship models clustered between 200k and 1M tokens of input.

Window size is not the same as effective recall. A 1M-token window does not mean the model reliably retrieves a fact buried at token 800,000; published needle-in-haystack benchmarks show retrieval quality degrades on most models past 50-200k tokens of dense content. Below is the side-by-side table sourced from each vendor's docs, plus worked examples of what each size actually fits. Quick-estimate token counts for your own documents with our AI prompt cost calculator, or grab the free LLM 2026 context cheat-sheet PDF.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

LLM context window and output cap — June 2026

Feature
Max input (tokens)
Max output (tokens)
Effective recall (per published needle benchmarks)
OpenAI gpt-5.5-pro1,000,00032,000Strong to ~300k
OpenAI gpt-5.5400,00016,000Strong to ~200k
OpenAI gpt-5.4400,00016,000Strong to ~200k
OpenAI gpt-5.4-mini400,00016,000Strong to ~150k
OpenAI o4-reasoning200,000100,000 (reasoning + output)Strong to ~100k
Anthropic Claude Opus 4.8500,00064,000Strong to ~300k
Anthropic Claude Sonnet 4.6500,00064,000Strong to ~250k
Anthropic Claude Haiku 4.5200,00016,000Strong to ~120k
Anthropic Claude Fable 51,000,000128,000Strong to ~400k
Google Gemini 3.5 Flash1,000,00065,536Strong to ~400k
Google Gemini 3.1 Pro (Preview)2,000,00065,536Strong to ~600k
Google Gemini 2.5 Pro1,000,00065,536Strong to ~350k
Google Gemini 2.5 Flash1,000,00065,536Strong to ~250k
Google Gemini 2.5 Flash-Lite1,000,00032,000Strong to ~150k
Meta Llama 4 Maverick1,000,0008,192Strong to ~200k
Meta Llama 4 Scout10,000,0008,192Mixed past ~500k
Mistral Large 3256,00016,384Strong to ~150k
DeepSeek V4256,00016,384Strong to ~120k
Qwen 3 Max1,000,00032,768Strong to ~200k
xAI Grok 4256,00032,000Strong to ~150k

Sources, as of June 2026: OpenAI model docs (https://platform.openai.com/docs/models), Anthropic model docs (https://docs.claude.com/en/docs/about-claude/models/overview), Google Gemini API docs (https://ai.google.dev/gemini-api/docs/models), Meta Llama docs (https://llama.com), Mistral docs (https://docs.mistral.ai/), DeepSeek docs (https://api-docs.deepseek.com), Qwen docs (https://qwen.readthedocs.io/), xAI docs (https://docs.x.ai/). Effective-recall figures are summarized from published needle-in-haystack and longbench results; real-world dense-content recall depends heavily on input structure.

Input window vs output cap: two numbers, both matter

Vendors quote two distinct context numbers. The input window is the total tokens the model accepts in a single request, including system prompt, conversation history, tool definitions, and the current user message. The output cap is the maximum tokens the model is willing to return in a single response — a separate, usually smaller, limit set by the vendor.

Confusing the two is the most common cost surprise we see. Claude Sonnet 4.6 has a 500k input window but caps output at 64k; if you ask it to translate a 200k-token document to another language, you cannot get the whole translation in one response — the output stops at 64k. You have to chunk the request.

Reasoning models complicate the output side further. OpenAI's o4-reasoning shares its 100k output budget between hidden reasoning tokens and visible output; a model that thinks for 80k tokens has only 20k left for the visible answer. Plan output budgets accordingly. For input-window strategy specifically, our code prompt builder helps structure long technical prompts that fit cleanly within tighter windows.


What does 200k, 1M, and 2M tokens actually hold?

Token-to-word conversion uses the rule of thumb 1 token ≈ 0.75 words in English. The actual ratio varies — code and non-English text run lower, structured data runs higher — but for planning the rule works.

200k tokens ≈ 150,000 words ≈ 500 pages of dense prose. Examples: the full text of War and Peace (1,200 pages) does not fit, but most full software books (300-450 pages) do. An average tech-company employee handbook plus its referenced policies fits comfortably.

1,000,000 tokens ≈ 750,000 words ≈ 2,500 pages of dense prose. Examples: the entire Harry Potter series (~1.1M words across 7 books) fits with margin to spare. A 200-page financial 10-K plus 50 supporting transcripts. A medium-sized codebase of 50-80k lines of code.

2,000,000 tokens ≈ 1,500,000 words ≈ 5,000 pages. Examples: 4-5 full novels at once, the complete works of Shakespeare with annotations, or a 300-file codebase with 200k LOC. At this size, retrieval-augmented generation (RAG) almost always beats stuffing everything in context — cheaper, faster, and usually more accurate per published benchmarks.

10,000,000 tokens (Llama 4 Scout): roughly 7.5M words, or 25,000 pages. The recall benchmarks past 500k are mixed; treat the headline number as 'we accept this much input' more than 'we will reliably reason over this much input.'


Effective recall: why bigger windows do not always mean better answers

The needle-in-haystack benchmark places a single specific fact at random positions within a long document and tests whether the model can retrieve it. Most models score near 100% on shorter inputs and degrade as the input grows — typically falling off a cliff between 50k and 200k tokens of dense content.

Per published 2026 benchmarks: Gemini 3.1 Pro Preview maintains 95%+ retrieval to roughly 600k tokens before degrading. Claude Opus 4.8 holds above 90% to ~300k. gpt-5.5 stays above 90% to ~200k. Llama 4 Scout, despite its 10M-token headline, shows mixed results past 500k.

The practical takeaway: design your prompt around effective recall, not advertised window. If the model's reliable range is 300k tokens but you need to query 500k, chunk the document, score chunks for relevance, and pass only the top-k matches in context. That is RAG, and it almost always beats raw context-stuffing past a certain document size.

For RAG specifically, embedding cost dominates the index-build bill — see our embedding cost calculator for current per-model embedding prices.


Worked example 1: a 250k-token contract review

Say you need to review a 250,000-token document — a 600-page contract bundle with exhibits. Which window fits?

Eligible by raw window: every model in the table except OpenAI o4-reasoning (200k) and Claude Haiku 4.5 (200k). Eligible by effective recall (assuming dense content): gpt-5.5-pro, Claude Opus 4.8, Claude Sonnet 4.6, Claude Fable 5, Gemini 3.x, Gemini 2.5 Pro, Qwen 3 Max.

Cost comparison for a single review with a 250k-input and 2k-output budget. gpt-5.5: $1.31 ($1.25 input + $0.06 output). Claude Sonnet 4.6: $0.78 ($0.75 + $0.03). Gemini 2.5 Pro: $0.33 ($0.3125 + $0.02). Claude Opus 4.8: $1.30 ($1.25 + $0.05). Gemini 3.1 Pro Preview: $0.52.

Same content, $0.33-$1.31 per review depending on model choice. If you run 1,000 reviews per month, the difference compounds to $330 vs $1,310 per month — a $980 monthly delta for the same workload. Match the model to required recall depth, then pick the cheapest option that hits the recall bar. For prompt-quality strategies that survive a cheaper tier, our meta-description generator helps compress retrieval queries.


Worked example 2: a long-form 50k-token output

You need to generate a 50,000-token document — a long-form report, a translated novel, a generated codebase. Which models can return that in one response?

Models that can return 50k tokens in a single call: Claude Opus 4.8 (64k output cap), Claude Sonnet 4.6 (64k), Claude Fable 5 (128k), Gemini 2.5/3.x family (65k), OpenAI o4-reasoning (100k shared with reasoning, so ~30-50k visible after reasoning). Most others cap at 8-32k output.

If your model caps below 50k, you must chunk: generate the first 16k, ask the model to continue from where it left off, repeat. Chunking introduces continuity risk — the second chunk can repeat content, lose thread, or change voice. Single-shot generation in a model with a higher output cap is almost always cleaner.

Cost note: at 50k output, Claude Sonnet 4.6 bills $0.75 per generation ($0.003 input on a small prompt + $0.75 output). At 50k output on gpt-5.5, you would have to chunk three times, paying input twice extra; the actual bill lands around $1.00-$1.20 depending on context replay.


Long context vs RAG: when to switch

The rule of thumb for 2026: under 100k tokens of relevant content, stuffing context is usually simpler and gives better answers. Between 100k and 500k, it depends on query density — a single targeted question is best served by RAG, while a multi-faceted analysis benefits from full context. Above 500k, RAG almost always wins on cost, latency, and accuracy.

Cost math: a single Gemini 2.5 Pro call at 1M input tokens costs $1.25 in. Querying the same document 10 times in a session costs $12.50. Building an embedding index of the same 1M tokens with text-embedding-3-small ($0.02/1M) costs $0.02 once, then queries pull only the top-k chunks (typically 5-20k tokens) at $0.0063-$0.025 per query — a 100-1,000x cost reduction at session scale.

Latency math: long-context calls take seconds to first token (often 5-20s on 1M-token inputs). RAG queries with a 10k-token retrieval typically return first token in under 1s. The cumulative UX difference at scale is large.

When to defy the rule: documents with cross-cutting references that no chunk-level retrieval will surface — long legal contracts where clauses reference each other across the document, multi-document financial analyses where you need correlations across all sources, code reviews on tightly coupled systems. There, the full context buys you something RAG cannot.


Pricing implications of large windows

Most providers charge a flat per-token rate regardless of window size, but some apply a surcharge above a threshold. As of June 2026, OpenAI charges its standard rate up to the full window on most models. Anthropic charges the same rate across the full 500k window on Sonnet 4.6 and Opus 4.8. Google charges the same rate up to 200k on Gemini 2.5 Pro and Gemini 3.1 Pro Preview, with a modest surcharge above that threshold (confirm on Gemini pricing).

The bigger cost factor is simply that long-context calls process more tokens. A 1M-token Gemini 2.5 Pro call costs $1.25 just for input, regardless of how many tokens the model actually uses. If you fill the window every call, your input bill scales linearly with window size — at 100k calls per month, $125,000.

Prompt caching changes this dramatically. Anthropic and OpenAI both offer cache discounts that bill the cached portion at 10% of the standard rate. For repeated queries against the same large document — a knowledge base, a contract, a codebase — caching turns a $1.25 call into $0.125. See Anthropic Claude pricing and OpenAI API pricing for the cache mechanics in detail.


Token-to-word ratios across languages and content types — and why long-context budgets vary by 3x

Every estimate in the table above assumes English prose at roughly 0.75 words per token. That ratio is convenient for napkin math, but it is one number out of a distribution that runs roughly 3x wide depending on language, character set, and content type. If you are budgeting a 200k or 1M-token window for a multilingual workload, planning at the English rate will routinely undershoot the real token count by 50-200%. The same window that holds 150,000 English words holds only 60-80,000 Chinese characters, 40-50,000 lines of pretty-printed JSON, and somewhere between 8,000 and 12,000 lines of Python depending on style.

Start with the tokenizers themselves. OpenAI's GPT family uses cl100k_base for GPT-4 and GPT-5.x, a byte-pair encoding (BPE) trained primarily on English web text with around 100,277 tokens in the vocabulary. Anthropic's Claude uses its own BPE tokenizer with comparable but not identical merges — token counts between OpenAI and Claude for the same English passage typically differ by 1-4% in either direction. Google's Gemini family uses SentencePiece with a vocabulary of roughly 256k tokens, which compresses non-Latin scripts more aggressively than cl100k_base. Llama 4 uses a 128k SentencePiece variant. The vocabulary size and training distribution determine how efficiently a given language compresses, and the gap between models on the same non-English text can hit 30-40%.

English compresses well because BPE tokenizers see enormous English training text and merge frequent substrings ('ing', 'tion', 'the ') into single tokens. The empirical English rate is 0.73-0.78 words per token across modern frontier tokenizers, or about 4 characters per token. Romance languages (Spanish, French, Italian, Portuguese) sit slightly worse — 0.65-0.72 words per token — because BPE training data skews English. German runs 0.55-0.65 because of long compound nouns that often fragment into 2-4 tokens. Russian and other Cyrillic-script languages typically run 0.4-0.55 words per token. Arabic, with morphologically rich words and right-to-left script, often runs 0.35-0.5.

Logographic and syllabic scripts are the punishing case. On cl100k_base, a typical Chinese character costs 1.5-2.5 tokens — meaning 100k tokens of Chinese fits only 40,000-65,000 characters, or roughly the length of a single 200-page novel rather than the 500-page bundle that the same window holds in English. Japanese is slightly worse than Chinese because kanji, hiragana, and katakana each tokenize differently. Korean Hangul runs 1.2-1.8 tokens per syllable block on cl100k_base. SentencePiece tokenizers (Gemini, Llama 4) cut this roughly in half — Gemini handles a Chinese character closer to 0.8-1.2 tokens — which is a real reason teams running CJK workloads gravitate toward Gemini or models with similar tokenizers.

Content type matters as much as language. Code is character-dense but token-sparse on a per-character basis (roughly 1 token per 3.5-4.5 characters), yet token-heavy on a per-line basis because identifiers, punctuation, and whitespace all consume tokens. A pragmatic rule: a 200k-token window holds 1,600-2,400 lines of densely-commented Python, 1,200-1,800 lines of Java or C#, 800-1,400 lines of TypeScript with JSX, or 6,000-10,000 lines of minified JavaScript. JSON and XML push the other direction — they are token-expensive because every quote, brace, and tag is its own token or two. A 200k-token window holds roughly 40-55k lines of formatted JSON or 25-35k lines of XML. Markdown sits between prose and code; mathematical notation in LaTeX is among the worst, running 0.3-0.5 'concepts' per token because every backslash command, brace pair, and subscript fragments heavily.

Worked example. A 200k-token context window holds approximately: 150,000 English words (about 500 pages), 100,000-120,000 Spanish words, 65,000-80,000 Chinese characters under cl100k_base, 110,000-130,000 Chinese characters under Gemini's tokenizer, 8,000-12,000 lines of Python, 4,000-6,000 lines of XML, or 45,000-55,000 lines of compact JSON. A 1M-token Gemini 2.5 Pro window holds roughly 750,000 English words but only 550,000-650,000 Chinese characters — still vastly more than cl100k_base would fit, but well short of the naive English extrapolation. The actionable rule for multilingual workloads is to budget at 1.5-2x the English token rate for non-Latin scripts on OpenAI and Claude, and roughly 1.2-1.5x on Gemini and Llama 4.

The practical advice: never commit to a window size based on character counts or word counts alone. Run your real content through the model's own tokenizer — OpenAI's tiktoken library for GPT, Anthropic's count_tokens endpoint for Claude, Google's count_tokens API for Gemini — on a representative sample of 5-10 real documents, then plan with a 20-30% safety buffer on top of the measured rate. The cost of mis-estimating is concrete: a workflow designed for 150k English words that actually runs on Chinese will hit the 200k window cap at document 1, fail silently or truncate, and ship broken responses to users. Measure first, then choose the window.


How to choose a window size for your workload

Start with the largest single piece of content your workload processes — a document, a conversation history, a codebase chunk. Add the system prompt, tool definitions, conversation memory, and a 20% safety buffer. That is your minimum required window.

If the answer is under 50k, almost every model works. If 50k-200k, eliminate Haiku 4.5 and o4-reasoning; everything else qualifies. If 200k-500k, eliminate Mistral Large 3, DeepSeek V4, and Grok 4. If 500k+, only the Gemini family, Claude Fable 5, gpt-5.5-pro (1M), and Llama 4 (1M-10M) make the cut.

Then test effective recall. Place a known fact at the 50%, 75%, and 90% positions of your typical max input, ask the model to retrieve it, and verify. If recall drops below 85% past your operational window, switch to RAG instead of pushing the model to its advertised limit.

For most teams the right move is: pick a model whose effective recall covers 80% of expected document sizes, use RAG for the long tail. See our GPT vs Claude vs Gemini cost calculator for a side-by-side cost breakdown at each window size.

Frequently Asked Questions

What is the largest LLM context window in 2026?

Meta Llama 4 Scout advertises a 10,000,000-token input window, though published needle-in-haystack benchmarks show recall degrades past ~500k tokens. The largest with strong recall above 500k is Google Gemini 3.1 Pro Preview at 2M tokens. See Llama docs and Gemini docs.

Is a larger context window always better?

No. Effective recall typically falls off well before the advertised maximum on every model, so a 1M-token window with 200k of strong recall often outperforms a 10M-token window with mixed recall past 500k. Match window to actual workload, not headline number.

What is the difference between input window and output cap?

Input window is the maximum tokens the model accepts in a request (prompt + history + tools). Output cap is the maximum it returns in one response. Claude Sonnet 4.6 has a 500k input but caps output at 64k — you can read a long document but cannot generate one as long in a single call.

Do reasoning models share output between thinking and answer?

Yes. OpenAI o4-reasoning has a 100k 'output' budget split between hidden reasoning tokens and visible answer. A model that thinks for 80k tokens has only 20k left for the response. Plan output caps with this in mind.

What is the cheapest model with a 1M+ token window?

Gemini 2.5 Flash-Lite at $0.10 input / $0.40 output per 1M tokens with a 1M-token window. It is the cheapest large-window option in 2026, though effective recall is more limited than Gemini 2.5 Pro or 3.x. Confirm at Gemini pricing.

Should I use long context or RAG for a 500k-token document?

Generally RAG, unless the query requires cross-document correlation that no chunk-level retrieval can surface. A typical single-question lookup is 100-1,000x cheaper through RAG (an embedded index plus a 10-20k token retrieval) than through full-context stuffing.

How many words is 1M tokens?

About 750,000 English words — roughly 2,500 pages of dense prose or the full Harry Potter series. For code, the ratio runs closer to 4-5 characters per token, so 1M tokens hold approximately 50-80k lines of code depending on language.

Do all providers charge the same per-token rate at long context?

Mostly. OpenAI and Anthropic charge a flat per-token rate across the full window. Google applies a modest surcharge above 200k input tokens on Gemini 2.5 Pro and 3.1 Pro Preview. Confirm rates on each vendor's live pricing page before designing a long-context workflow.

Why does a 200k-token window hold less Chinese content than English content?

Tokenizers like OpenAI's cl100k_base are trained primarily on English and merge frequent English substrings into single tokens, so English compresses at roughly 0.75 words per token. Chinese characters on the same tokenizer cost 1.5-2.5 tokens each, so 200k tokens holds about 150k English words but only 65-80k Chinese characters. Gemini's SentencePiece tokenizer roughly halves the gap, pushing Chinese to about 0.8-1.2 tokens per character.

How many lines of code fit in a 200k-token context window?

Roughly 8,000-12,000 lines of densely-commented Python, 4,000-6,000 lines of XML, or 45,000-55,000 lines of compact JSON. Lower-density languages like Java or C# fall closer to 6,000-9,000 lines per 200k tokens. The variance comes from whitespace, identifier length, and punctuation density rather than raw character count — measure with the model's own tokenizer (tiktoken for OpenAI, count_tokens for Claude and Gemini) on a real sample before committing.

Do OpenAI, Anthropic, and Google tokenizers produce the same token count for the same text?

No. For English prose the three frontier tokenizers typically agree within 1-5%, but for non-English text or code the gap can hit 30-40%. OpenAI's cl100k_base BPE, Anthropic's Claude BPE, and Google's SentencePiece (used by Gemini) compress non-Latin scripts very differently. Always measure your real workload with the target model's tokenizer rather than assuming GPT-derived counts will hold elsewhere.

Get the 2026 LLM context cheat sheet

One-page PDF with every model's max input, max output, and effective recall — printable, free, no signup gate.

Browse all prompt tools →