Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LLM Output Speed in 2026: Tokens Per Second Across Every Major Model

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Output speed is measured two ways: time to first token (TTFT) — how long after the request before the model starts streaming — and tokens per second (tok/s) — how fast it streams after the first token. As of June 2026, TTFT ranges from ~100ms (Groq, Cerebras, Gemini Flash family) to ~3-10s (large reasoning models with cold-start cache), and median streaming tok/s ranges from ~30 (gpt-5.5-pro, Claude Opus 4.8) to 2,000+ (Groq Llama 3.3-70B, Cerebras Llama 4).

Latency materially affects UX. Below 200ms TTFT users perceive instant responsiveness; above 1s they feel waiting; above 3s many users abandon. For chat apps, TTFT matters more than tok/s; for batch generation, tok/s matters more. Below is the per-model benchmark, the underlying infrastructure that determines speed, and worked examples for typical workloads. For prompt-length strategy that improves TTFT, our ChatGPT prompt generator and code prompt builder help compress inputs; the free latency cheat sheet PDF prints the full benchmark table.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

LLM speed benchmark — median TTFT & tokens/sec, June 2026

Feature
TTFT median (ms)
Tokens/sec (median)
Tokens/sec (p95)
Where to run
OpenAI gpt-5.5-pro~800-1,500~30~50OpenAI API only
OpenAI gpt-5.5~400-700~65~95OpenAI API
OpenAI gpt-5.4~300-500~85~120OpenAI API
OpenAI gpt-5.4-mini~200-350~150~210OpenAI API
OpenAI gpt-5.4-nano~150-250~250~320OpenAI API
OpenAI o4-reasoning~2,000-6,000 (incl. reasoning)~45~70OpenAI API
Anthropic Claude Opus 4.8~900-1,800~40~60Anthropic API + Bedrock + Vertex
Anthropic Claude Sonnet 4.6~500-900~75~110Anthropic API + Bedrock + Vertex
Anthropic Claude Haiku 4.5~200-400~165~230Anthropic API + Bedrock + Vertex
Anthropic Claude Fable 5~3,000-9,000 (incl. reasoning)~50~75Anthropic API
Google Gemini 3.1 Pro Preview~600-1,200~85~125Gemini API + Vertex
Google Gemini 2.5 Pro~500-1,000~95~140Gemini API + Vertex
Google Gemini 2.5 Flash~250-450~210~290Gemini API + Vertex
Google Gemini 2.5 Flash-Lite~150-300~310~420Gemini API + Vertex
Mistral Large 3~400-800~80~120Mistral API + Together + Bedrock
Llama 4 Maverick (Together)~300-600~120~180Together + Fireworks + Self-host
Llama 4 Scout (Cerebras)~80-150~2,000+~2,400Cerebras Inference
Llama 3.3-70B (Groq)~80-150~1,200-1,500~1,800Groq Cloud
Llama 3.3-70B (Together)~250-500~110~165Together AI
DeepSeek V4~600-1,200~70~100DeepSeek API + Together

Sources, as of June 2026: ArtificialAnalysis.ai latency dashboard (https://artificialanalysis.ai/), Groq cloud benchmarks (https://groq.com/benchmark), Cerebras inference benchmarks (https://cerebras.ai/inference), provider-internal reports where public, and DDH's own June 2026 measurements at 1,000-token prompts. TTFT includes network round-trip from US-East; varies by region. Speed varies by load and time of day — published numbers are medians over multi-hour samples. Reasoning models include hidden chain-of-thought tokens in TTFT measurement.

What 'tokens per second' actually means

Tokens per second is the rate at which the model streams tokens after the first one arrives. It is measured post-TTFT — the first token does not count toward the streaming rate.

Total response latency = TTFT + (output_tokens / tokens_per_second). A 500-token answer at 75 tok/s on Claude Sonnet 4.6 (TTFT ~700ms) takes 700ms + (500/75)s = 700ms + 6.67s = ~7.4 seconds end to end.

TTFT varies with prompt size. A 1k-token prompt typically hits TTFT in 300-500ms on a fast tier model; a 100k-token prompt can push TTFT to 5-15 seconds because the model has to process the entire input before producing the first output token. Caching helps — cached prefixes skip the prefill pass.

Tok/s is much less variable with prompt size, since output generation is per-token regardless of input length. A 30 tok/s model stays at ~30 tok/s whether the prompt is 100 tokens or 100k tokens.


Worked example 1: chat UX at typical sizes

Reference workload: chat app with 1,500-token input (system prompt + history), 200-token response.

OpenAI gpt-5.5 (TTFT ~500ms, ~65 tok/s): 500ms + (200/65)s = 500ms + 3.08s = ~3.6s end to end. Tolerable for chat; users see streaming begin in half a second.

Claude Sonnet 4.6 (TTFT ~700ms, ~75 tok/s): 700ms + 2.67s = ~3.4s. Close to gpt-5.5.

Gemini 2.5 Flash (TTFT ~300ms, ~210 tok/s): 300ms + 0.95s = ~1.25s. Substantially faster, especially the TTFT.

Groq Llama 3.3-70B (TTFT ~100ms, ~1,300 tok/s): 100ms + 0.15s = ~250ms. Sub-300ms end-to-end — feels instant.

Cerebras Llama 4 Scout (TTFT ~120ms, ~2,000 tok/s): 120ms + 0.10s = ~220ms. Similar feel to Groq.

For chat UX, the difference between 1.2s (Gemini Flash) and 3.5s (Claude Sonnet) is felt as 'fast' vs 'normal.' The difference between 250ms (Groq) and 1.2s (Gemini Flash) is felt as 'instant' vs 'fast.' Match the speed tier to your UX goal.


Worked example 2: long-form streaming generation

Reference workload: generate a 4,000-token essay from a 500-token outline.

Claude Opus 4.8 (TTFT ~1,200ms, ~40 tok/s): 1.2s + (4000/40)s = 1.2s + 100s = ~101s. Nearly two minutes; users will wait but the experience drags.

Claude Sonnet 4.6 (TTFT ~700ms, ~75 tok/s): 0.7s + 53.3s = ~54s. Still slow but tolerable for premium long-form.

gpt-5.4 (TTFT ~400ms, ~85 tok/s): 0.4s + 47s = ~47s.

Gemini 2.5 Flash (TTFT ~300ms, ~210 tok/s): 0.3s + 19s = ~19s. The Flash family wins long-form generation by a wide margin.

Groq Llama 3.3-70B (TTFT ~100ms, ~1,300 tok/s): 0.1s + 3.1s = ~3.2s. The same essay in under 4 seconds — useful for high-volume content workflows.

Cerebras Llama 4 Scout (TTFT ~120ms, ~2,000 tok/s): 0.12s + 2s = ~2.1s.

For long-form generation, the difference between 50s (Sonnet 4.6) and 3s (Groq) is the difference between 'walk away and come back' and 'wait at the screen.' Different products want different things — premium magazines often prefer the slower, better-tuned premium models; high-volume content pipelines prefer the fastest tier.


Why some providers are 10-20x faster than others

Three infrastructure choices drive the speed spread.

Custom inference silicon. Cerebras runs Llama 4 on its wafer-scale CS-3 chips; Groq runs models on its LPU architecture. Both are purpose-built for LLM inference and skip many of the optimization compromises a general-purpose GPU has to make. The result: 5-20x the per-token throughput of NVIDIA H100/H200 deployments at lower batch sizes.

Model size and architecture. Larger models (gpt-5.5-pro, Claude Opus 4.8) are inherently slower per token than smaller models (Haiku 4.5, gpt-5.4-mini, Gemini 2.5 Flash) because each token requires more computation. Mixture-of-experts architectures (DeepSeek V4, Llama 4) can run faster than dense models of similar nominal parameter count because only some experts activate per token.

Speculative decoding and other acceleration tricks. Modern providers run speculative decoding (a small draft model generates candidate tokens that the big model verifies in parallel), which can 2-4x throughput on workloads with predictable text. Most providers run this transparently as of 2026.

The infrastructure choice often matters more than the model choice for speed. A 70B Llama 3.3 on Groq runs 10x faster than the same model self-hosted on a single H100. If sub-second latency matters, the question is not 'which model' but 'which provider for which model.'


TTFT vs tok/s: which matters when

Chat apps: TTFT matters most. Users perceive responsiveness from the first character of streaming. A model with 200ms TTFT and 60 tok/s feels faster than a model with 800ms TTFT and 200 tok/s, even though the latter finishes a 200-token response sooner.

Voice agents: TTFT matters extremely. Speech recognition + LLM + TTS round-trips need to stay under 800-1,000ms total for natural conversation flow. The LLM portion of that budget is typically 200-400ms — only Gemini Flash family, Haiku 4.5, gpt-5.4-mini, Groq, and Cerebras consistently hit it.

Long-form generation (essays, code, reports): tok/s matters most. TTFT difference of 500ms is invisible when the total generation takes 30+ seconds. Optimize for the steady-state rate.

Batch generation: only end-to-end time matters. TTFT and tok/s sum into total throughput. Choose the model that finishes a batch of N completions fastest — usually a Flash-tier model on a major cloud, or Groq/Cerebras for self-served quality.

Reasoning workloads: tok/s of visible output matters less because the hidden reasoning dominates total latency. An o4-reasoning model that takes 5s to reason then streams 200 tokens at 45 tok/s spends ~4-5s reasoning and ~4.4s streaming — total 9-10 seconds. Match the reasoning depth to the task, not the visible output size.


Speed-cost tradeoff: when to pay for fast

Speed costs money in 2026, both per-token and via specialty providers.

Per-token: faster models within a single provider's lineup are usually cheaper (gpt-5.4-mini at $0.75/$4.50 is both faster and cheaper than gpt-5.5 at $5/$30). So 'faster' often means 'smaller cheaper model' — a free upgrade if quality holds.

Specialty providers (Groq, Cerebras, SambaNova) often charge a premium over commodity inference for the same open-weights model. Groq Llama 3.3-70B is typically $0.59/$0.79 per 1M tokens; Together's Llama 3.3-70B is $0.88/$0.88. The premium buys you 10x the speed.

When the premium is worth it: voice agents (sub-second is hard requirement), interactive coding assistants (long output, latency kills flow), tight UX loops (each second of latency costs conversion). When it is not: batch jobs (you do not care if the work finishes in 5 minutes vs 50), non-real-time content generation (publish queue absorbs latency).

Hybrid pattern that works: use Groq or Cerebras for the user-facing inference where latency is felt; use Together or AWS Bedrock for the batch or background work where it is not.


Streaming and partial-display strategies

Streaming lets users see content arrive in real time, masking the gap between TTFT and end-of-response. A 3.5-second total response feels much faster when the user can read the first sentence at 1 second.

All major providers support streaming on their chat endpoints. The pattern is well-established: open a server-sent-events connection, accumulate tokens client-side, render progressively. SDKs (OpenAI, Anthropic, Google) all include streaming-first methods.

UX patterns that reduce perceived latency further. Show a typing indicator before the first token (skeleton state). Render structured responses progressively — show the headline immediately, lazy-load the body. Stream into pre-allocated layout to prevent reflows. Cache the prompt server-side so repeats are faster.

Caching strategy interacts with TTFT. Prompt caching reduces TTFT on long inputs by skipping the prefill pass for cached portions. On a 100k-token cached prefix, TTFT can drop from 10s to under 1s. See Anthropic Claude pricing and OpenAI API pricing for caching cost mechanics.


Throughput vs latency: how to design for batch generation at maximum tokens per dollar per second

Single-request tokens per second is the wrong metric for batch workloads. A model that streams at 210 tok/s per request can deliver tens of thousands of tokens per second in aggregate when you fan out across concurrent requests — and that aggregate number is what determines whether you finish a 100M-token job in an hour, a night, or a week. Designing for throughput is a different exercise than designing for latency, and the optimal model, provider, and concurrency level often flip when you switch goals.

Worked example: a Gemini 2.5 Flash deployment running 200 parallel requests, each streaming at the published 210 tok/s median, yields roughly 42,000 tok/s of aggregate output throughput — about 151M tokens per hour. The per-request latency does not change (each user still waits ~1.25s end to end for a 200-token answer); what changes is how much work the API tier processes in a given wall-clock minute. The same Flash tier serving a synchronous chat app might only see 5-20 concurrent requests at peak; the same tier behind a batch classification job might sustain 200-500 concurrent requests for hours.

Concurrency sizing for batch jobs is bounded by provider rate limits, not by per-call speed. As of June 2026, the OpenAI tier-5 limit on gpt-5.4-mini is 30,000 RPM and 150M TPM; the Anthropic tier-4 limit on Haiku 4.5 is 4,000 RPM and 400k input TPM with a separate output TPM cap; the Google Vertex tier on Gemini 2.5 Flash sits at 2,000 RPM per project per region with quota increases up to 10,000 RPM on request. The right concurrency for a batch job is the number that saturates whichever of these limits you hit first — usually TPM, not RPM. A useful rule of thumb: target_concurrency = TPM_limit / (avg_tokens_per_request × requests_per_second_per_worker). For most flagship-tier APIs, 50 to 200 concurrent workers saturates the limits without sustained 429s.

Diminishing returns set in past concurrency 200 on most providers. Above that point, you start hitting TPM caps and seeing rate-limit responses, retry storms, and tail-latency inflation as the provider's internal queues fill. Adding more workers does not increase throughput — it just increases the failure rate. The fix is not more concurrency but multi-region deployment (a US-East and a US-West project each gets its own quota), multi-key fanout where the provider's terms allow it, or moving the spillover to a second provider with a cross-provider router. Real production batch pipelines at scale almost always run multi-region and often multi-provider for exactly this reason.

Specialty providers (Groq, Cerebras, SambaNova) excel on single-request speed but have lower aggregate throughput than commodity inference clouds because they run on smaller fleets with fewer concurrent slots. Groq's published per-key concurrency limits sit around 30-60 concurrent requests on most models at standard tiers; Cerebras is similar. Their per-request speed is 10-20x faster, but their concurrent slot count is often 5-10x lower. The aggregate number can still win for medium-sized jobs (10M-100M tokens) but the math flips for very large jobs (1B+ tokens) where commodity inference at higher concurrency wins on total wall-clock time and almost always on cost.

The right answer for 'I need to process 1M items overnight' is almost always a Batch API, not synchronous high-concurrency. OpenAI Batch, Anthropic Message Batches, and Google Vertex batch prediction all accept a JSONL file, process it within a 24-hour SLA, and charge 50% of the synchronous rate. For a 1M-item classification job at 1,500 input tokens and 50 output tokens per item, that is 1.55B tokens of work. On gpt-5.4-mini synchronous at $0.75/$4.50 per 1M tokens, the bill is about $1,350; on the Batch API equivalent, it is about $675 and you skip the rate-limit engineering entirely. The 24-hour window is not a constraint when the job is genuinely overnight work.

The right answer for 'I need to process 1M items in 1 hour' is high-concurrency synchronous, but only with a rate-limit-aware client and multi-region failover. The minimum throughput requirement is roughly 280 items per second sustained, which at 1,550 tokens per item means ~430k tokens per second of aggregate work — that requires saturating multiple TPM tiers across regions or providers. Practical architecture: a queue (SQS, Pub/Sub, or Redis streams) draining into a worker pool of 150-300 concurrent processes per region, each running an exponential-backoff client that re-routes on 429 to a secondary region or provider, with a circuit breaker that pulls a region out of rotation if its error rate exceeds 5%. The cost premium over Batch is 2x on rates plus the engineering time, but you get the result in an hour instead of a day.


How to benchmark speed on your real workload

Step 1: pick a representative request — your typical input size and expected output size, run from your production region against the model's normal endpoint. Time it 100 times across different hours.

Step 2: record three numbers per request: TTFT (time from request to first byte), tokens/sec (total output tokens / streaming duration), and total end-to-end time. Compute median and p95 across the 100 samples.

Step 3: compare against the published medians in the table above. Large gaps mean either your prompt shape is unusual (long input slows TTFT), your region is far from the provider's deployment, or you are hitting load-induced slowdowns. Adjust accordingly.

Step 4: track over time. Provider speed can drift — a model that was 75 tok/s at launch can drop to 40 tok/s after a few months of optimization tradeoffs, or jump after an infrastructure upgrade. Re-benchmark quarterly.

Step 5: if speed matters and you cannot move providers, optimize the prompt. Shorter prompts cut TTFT linearly. Caching cuts TTFT on long stable prefixes by 80%+. Smaller models cut both TTFT and increase tok/s, usually with proportional quality cost — eval on your real workload before deciding.

Frequently Asked Questions

What is the fastest LLM in 2026?

On raw tokens/sec, Cerebras Llama 4 Scout and Groq Llama 3.3-70B lead at 1,200-2,000+ tokens/sec — roughly 10-20x faster than commodity-hosted versions of the same models. Among proprietary models, Gemini 2.5 Flash-Lite leads at ~310 tok/s. Confirm against ArtificialAnalysis.ai for current numbers.

What is a good TTFT for a chat app?

Under 500ms feels responsive. Under 200ms feels instant. Above 1 second feels slow. Voice agents need sub-300ms TTFT to keep conversation flow natural. Most Flash-tier models, Groq, and Cerebras consistently hit these targets; mid-tier models like Claude Sonnet 4.6 and gpt-5.5 typically run 500-900ms TTFT.

Does prompt length affect tokens per second?

Marginally for streaming rate (tok/s stays roughly constant), but heavily for TTFT — a 100k-token input can push TTFT to 5-15 seconds because the model has to process the prefix before producing the first output token. Caching reduces this dramatically when the prefix is stable.

Why is Groq so much faster than OpenAI?

Groq runs on custom LPU silicon designed specifically for LLM inference, while OpenAI runs on NVIDIA GPUs optimized for both training and inference. The specialized hardware skips many of the compromises a general-purpose GPU has to make, yielding 10-20x per-token throughput. Cerebras (CS-3 wafer-scale) achieves similar results through different architecture.

Do reasoning models report tok/s during reasoning?

No — reasoning tokens are hidden, and TTFT to visible output includes the full reasoning duration. An o4-reasoning model with 5 seconds of reasoning before a 200-token visible output reports TTFT ~5,000ms; the streaming rate of visible output is typically 40-70 tok/s.

Should I use Groq or Cerebras for my production workload?

Both excel on speed for the open-weights models they host (mostly Llama variants). Cerebras typically wins on raw tok/s; Groq has broader model selection and better startup pricing. Both cost more per token than commodity hosts (Together, Fireworks) but deliver 10x the throughput — worth it for latency-critical workloads.

How do I make my LLM responses feel faster?

Three levers: stream the response (mask TTFT-to-end gap with progressive rendering), cache the prompt prefix (cuts TTFT on long inputs by 80%+), and use a smaller faster model where quality holds. Most chat apps gain more from UX-side streaming and skeleton states than from switching models.

Where can I see live LLM speed benchmarks?

ArtificialAnalysis.ai tracks per-model TTFT and tok/s continuously across providers and regions. Groq and Cerebras publish their own per-model speed numbers. Always re-measure on your specific region and prompt shape; published medians can drift.

How many concurrent requests should I run for a batch job?

Start at 50 concurrent workers, monitor TPM usage and 429 rates, then double until you hit either your TPM cap or sustained 5%+ rate-limit errors. Most flagship-tier APIs saturate between 50 and 200 concurrent workers; pushing past 200 typically degrades throughput because retries pile up. If you need more aggregate throughput, add a second region rather than more workers in the first region — each project/region gets its own quota bucket.

Should I use the Batch API or synchronous high-concurrency calls?

Use the Batch API (OpenAI Batch, Anthropic Message Batches, Vertex batch prediction) whenever the work is genuinely asynchronous and tolerates a 24-hour SLA — you pay 50% of the synchronous rate and skip the rate-limit engineering. Use synchronous high-concurrency only when you need results within hours, when results feed a downstream pipeline that cannot wait, or when the workload is too small for batch overhead to amortize (under ~10k requests).

Why does Groq have lower aggregate throughput than OpenAI even though it is faster per request?

Per-request speed and concurrent-slot count are separate axes. Groq's LPU fleet is smaller than OpenAI's GPU fleet, so per-key concurrency is typically capped at 30-60 simultaneous requests on most models. OpenAI's standard tiers allow hundreds of concurrent requests per key. For latency-sensitive single-user workloads Groq wins; for high-volume batch jobs where aggregate tokens-per-second matters more than per-request latency, OpenAI or Google's commodity tiers usually deliver more total throughput per hour.

Get the 2026 LLM speed cheat sheet

One-page PDF with every model's TTFT, tok/s, and end-to-end latency — free, no signup gate.

Browse all prompt tools →