What Gemini 1.5 Pro's 1-million-token context window actually means
A token is not a word — it is the sub-word unit a language model's tokenizer uses. For English prose, the rule of thumb is roughly 750 words per 1,000 tokens, or about 1.3 tokens per word. At 1,048,576 tokens, Gemini 1.5 Pro's GA context window holds approximately 786,000 words of plain English text, which is equivalent to a 2,600-page book, about 15 average-length novels, or a full year of corporate email threads from a mid-size organization.
Code is denser in tokens than prose. Python files tokenize at roughly 200–350 tokens per kilobyte depending on comment density and variable name length. At 1M tokens, you can fit approximately 3,000–5,000 kilobytes of source code — enough for a substantial monorepo, an entire microservices layer, or multiple years of git history in a single project. Google's own technical report on Gemini 1.5 demonstrated the model completing 'needle-in-a-haystack' recall tasks at up to 1M tokens with near-perfect accuracy across text, audio, and video.
For multimodal content, the token math changes: a 720p video frame at medium quality consumes roughly 258 tokens, which means 1M tokens holds approximately 1 hour of video at 1 frame per second. Audio is priced and counted separately in the Gemini API but can be included in the same context window alongside text and images. This multimodal long-context capability — not just text recall — is what distinguishes Gemini 1.5 Pro from most competitors.