Every estimate in the table above assumes English prose at roughly 0.75 words per token. That ratio is convenient for napkin math, but it is one number out of a distribution that runs roughly 3x wide depending on language, character set, and content type. If you are budgeting a 200k or 1M-token window for a multilingual workload, planning at the English rate will routinely undershoot the real token count by 50-200%. The same window that holds 150,000 English words holds only 60-80,000 Chinese characters, 40-50,000 lines of pretty-printed JSON, and somewhere between 8,000 and 12,000 lines of Python depending on style.
Start with the tokenizers themselves. OpenAI's GPT family uses cl100k_base for GPT-4 and GPT-5.x, a byte-pair encoding (BPE) trained primarily on English web text with around 100,277 tokens in the vocabulary. Anthropic's Claude uses its own BPE tokenizer with comparable but not identical merges — token counts between OpenAI and Claude for the same English passage typically differ by 1-4% in either direction. Google's Gemini family uses SentencePiece with a vocabulary of roughly 256k tokens, which compresses non-Latin scripts more aggressively than cl100k_base. Llama 4 uses a 128k SentencePiece variant. The vocabulary size and training distribution determine how efficiently a given language compresses, and the gap between models on the same non-English text can hit 30-40%.
English compresses well because BPE tokenizers see enormous English training text and merge frequent substrings ('ing', 'tion', 'the ') into single tokens. The empirical English rate is 0.73-0.78 words per token across modern frontier tokenizers, or about 4 characters per token. Romance languages (Spanish, French, Italian, Portuguese) sit slightly worse — 0.65-0.72 words per token — because BPE training data skews English. German runs 0.55-0.65 because of long compound nouns that often fragment into 2-4 tokens. Russian and other Cyrillic-script languages typically run 0.4-0.55 words per token. Arabic, with morphologically rich words and right-to-left script, often runs 0.35-0.5.
Logographic and syllabic scripts are the punishing case. On cl100k_base, a typical Chinese character costs 1.5-2.5 tokens — meaning 100k tokens of Chinese fits only 40,000-65,000 characters, or roughly the length of a single 200-page novel rather than the 500-page bundle that the same window holds in English. Japanese is slightly worse than Chinese because kanji, hiragana, and katakana each tokenize differently. Korean Hangul runs 1.2-1.8 tokens per syllable block on cl100k_base. SentencePiece tokenizers (Gemini, Llama 4) cut this roughly in half — Gemini handles a Chinese character closer to 0.8-1.2 tokens — which is a real reason teams running CJK workloads gravitate toward Gemini or models with similar tokenizers.
Content type matters as much as language. Code is character-dense but token-sparse on a per-character basis (roughly 1 token per 3.5-4.5 characters), yet token-heavy on a per-line basis because identifiers, punctuation, and whitespace all consume tokens. A pragmatic rule: a 200k-token window holds 1,600-2,400 lines of densely-commented Python, 1,200-1,800 lines of Java or C#, 800-1,400 lines of TypeScript with JSX, or 6,000-10,000 lines of minified JavaScript. JSON and XML push the other direction — they are token-expensive because every quote, brace, and tag is its own token or two. A 200k-token window holds roughly 40-55k lines of formatted JSON or 25-35k lines of XML. Markdown sits between prose and code; mathematical notation in LaTeX is among the worst, running 0.3-0.5 'concepts' per token because every backslash command, brace pair, and subscript fragments heavily.
Worked example. A 200k-token context window holds approximately: 150,000 English words (about 500 pages), 100,000-120,000 Spanish words, 65,000-80,000 Chinese characters under cl100k_base, 110,000-130,000 Chinese characters under Gemini's tokenizer, 8,000-12,000 lines of Python, 4,000-6,000 lines of XML, or 45,000-55,000 lines of compact JSON. A 1M-token Gemini 2.5 Pro window holds roughly 750,000 English words but only 550,000-650,000 Chinese characters — still vastly more than cl100k_base would fit, but well short of the naive English extrapolation. The actionable rule for multilingual workloads is to budget at 1.5-2x the English token rate for non-Latin scripts on OpenAI and Claude, and roughly 1.2-1.5x on Gemini and Llama 4.
The practical advice: never commit to a window size based on character counts or word counts alone. Run your real content through the model's own tokenizer — OpenAI's tiktoken library for GPT, Anthropic's count_tokens endpoint for Claude, Google's count_tokens API for Gemini — on a representative sample of 5-10 real documents, then plan with a 20-30% safety buffer on top of the measured rate. The cost of mis-estimating is concrete: a workflow designed for 150k English words that actually runs on Chinese will hit the 200k window cap at document 1, fail silently or truncate, and ship broken responses to users. Measure first, then choose the window.