Tokenizer basics: why ~4 characters and ~0.75 words per token
Every major LLM converts raw text into a sequence of tokens before processing it. Tokens are not words — they are subword units produced by a byte-pair encoding (BPE) algorithm that balances vocabulary size against sequence length. OpenAI's tiktoken library, used by the GPT-4o and GPT-5 families, operates on a ~100k-token vocabulary. Anthropic's Claude tokenizer uses a similar BPE scheme. Google's SentencePiece tokenizer, used by Gemini 2.5 Pro, is also BPE-based but uses a different merge order and vocabulary.
The rule of thumb that holds across all three is approximately **4 characters per token** and **0.75 words per token** for standard English prose. That means a 1,000-word blog post is roughly 1,333 tokens; a 500-word system prompt is roughly 667 tokens. These ratios shift meaningfully in a few cases: (1) code — identifiers, brackets, and whitespace often tokenize differently than prose, and minified JavaScript or Python with long variable names can run 30–50% more tokens per character than plain English; (2) non-Latin scripts — Chinese, Japanese, and Arabic text can consume 1–3 tokens per character rather than 0.25, making the same 'word count' 4–12x more expensive in tokens; (3) numbers and special characters — a 10-digit phone number might be 5–10 tokens depending on whether the tokenizer splits on each digit.
You can measure your exact token count for free using OpenAI's Tokenizer playground or the tiktoken Python package (`tiktoken.encoding_for_model('gpt-4o').encode(text)`). For Claude, Anthropic's token counting API endpoint returns the exact count for any message array. For Gemini, the `countTokens` method in the Google AI SDK does the same. These tools take 30 seconds to use and will immediately tell you whether your prompt is 200 tokens or 20,000 tokens.