The LPU architecture — why Groq is fast and what that has to do with rate limits
Groq's hardware is the **Language Processing Unit (LPU)**, a deterministic-latency tensor processor designed exclusively for sequential token generation. Where a GPU (H100, MI300X, TPU v5) optimizes for batch parallelism — many requests sharing the same matmul kernels — an LPU optimizes for **single-request latency**: the model weights live in on-chip SRAM, the schedule is statically compiled, and each token rolls off the rack at the same wall-clock interval regardless of prompt length.
The throughput numbers are real and reproducible. Llama 3.1 8B Instant: **~560 tokens per second per request**, end to end. Llama 3.3 70B Versatile: **~280 tok/s**. The new flagship OpenAI GPT-OSS 20B (Groq-hosted, MIT-licensed): **~1,000 tok/s**. GPT-OSS 120B: **~500 tok/s**. For comparison, OpenAI's hosted gpt-5.4-mini lands around 80-120 tok/s, Anthropic's Claude Haiku 4.5 at 130-180 tok/s, and Llama 3.3 70B on Together AI's H100 cluster around 60-100 tok/s. Groq is **3-5x faster** on the same open-weight model.
The catch is total throughput. An LPU rack serves one request at world-record speed; serving 10,000 concurrent users requires racks of LPUs. Groq's hosted-API business runs a finite pool of LPU capacity that is shared across the entire customer base, which is why the rate-limit ceiling per account is intentionally tight relative to GPU-hosted competitors. **Groq's pitch is speed for low-volume workloads**, not unlimited throughput. If you need 50K TPM on a 70B model, OpenAI or Together is the cheaper place to buy it; if you need sub-200ms time-to-first-token plus 280 tok/s, Groq is the only place to buy it.