Throughput benchmarks: tok/s is the headline number
Fast inference is sold on throughput — tokens per second the model can produce. Higher tok/s = lower end-to-end latency for the user (especially on long generations) and higher achievable RPM/TPM ceilings at the same hardware budget.
**Cerebras WSE on Llama 3.3 70B**: ~2,200 tok/s — the fastest publicly available 70B inference on the market. The wafer-scale architecture removes the inter-chip communication overhead that limits GPU clusters. For long generations (>500 output tokens), Cerebras finishes 5-10x faster than the next-fastest provider end-to-end.
**Groq LPU on Llama 3.3 70B**: ~275 tok/s — extremely fast by industry standards, ~5-10x faster than typical GPU-based inference, but ~8x slower than Cerebras. The LPU architecture's strength is cost-per-token efficiency, not absolute speed; Groq is the throughput-per-dollar winner.
**Together AI on H100/H200 (Llama 3.3 70B Turbo)**: ~120-180 tok/s depending on cluster configuration. Slower than the dedicated-chip providers but competitive with hyperscaler GPU inference. Together's bet is on model catalog breadth and fine-tuning flexibility, not raw speed.
**For context**: typical OpenAI gpt-4o output speed is ~80-120 tok/s. Anthropic Claude Sonnet 4.6 is ~60-100 tok/s. Both Groq and Cerebras meaningfully exceed hyperscaler defaults; Cerebras dwarfs them.
**Throughput vs latency**: tok/s measures sustained output speed. TTFT (time to first token) is a separate dimension — Cerebras and Groq both have sub-200ms TTFT on cached prompts; Together is typically 300-500ms. For voice agents and real-time UX, both numbers matter.