Step 1 — Map your hard constraints before you look at benchmarks
The most common mistake teams make when deciding how to choose an LLM for production is starting with benchmark leaderboards rather than their own system constraints. A model that scores top-3 on MMLU but has a 2-second P99 latency is the wrong choice for a real-time chat product. A model with a 4k context window is the wrong choice for a legal document review pipeline processing 50-page contracts.
Map these four constraints first — they will eliminate half the candidates before you run a single benchmark. **Latency SLA:** what is your P99 acceptable response time? Real-time UIs typically need sub-800ms time-to-first-token. Async pipelines can tolerate 5-30 seconds. Background batch jobs can run overnight. **Context window:** how many tokens do your longest inputs realistically reach? Add 30% buffer for retrieval and tool outputs. **Data residency / privacy:** does your data processing agreement allow sending data to OpenAI, Anthropic, or Google? Or do you need self-hosted open weights? **Rate limits:** what is your peak requests-per-minute, and can you absorb the tier 1 limits or do you need enterprise agreements? See our LLM rate limits guide for the full table.
Only models that satisfy all four hard constraints should proceed to capability evaluation. This step typically narrows 15+ options down to 3-5 realistic candidates, which is a manageable set to actually evaluate with your own data.