By The DDH Team · Digital Dashboard Hub

Groq API Rate Limits 2026: RPM, TPM, RPD, TPD per Model — Free vs Dev vs Enterprise

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Groq runs custom **Language Processing Units (LPUs)** instead of GPUs, which is why a single Llama 3.3 70B request streams at roughly **280 tokens per second** and an 8B model hits **560 tok/s** — multiple times faster than any GPU-hosted equivalent on OpenAI, Anthropic, or Together. The architecture is deterministic latency: each token is the same wall-clock distance from the prior token because the model weights are sharded across SRAM on-chip rather than streamed from HBM. For interactive workloads — chat, agents, voice — this is genuinely transformative.

The trade-off is rate limits. Groq enforces **four simultaneous dimensions** on every model: **RPM** (requests per minute), **TPM** (tokens per minute), **RPD** (requests per day), and **TPD** (tokens per day). Any of the four can bind first, and on the Free tier the binding limit is almost always TPM or TPD — not RPM. A Free-tier account on Llama 3.3 70B Versatile gets **30 RPM, 12K TPM, 1K RPD, 100K TPD** as of June 2026. The 12K TPM ceiling means you can fire one or two long-context requests per minute before throttling, regardless of how fast Groq's LPUs can serve them.

Below: the canonical per-model rate-limit table, the four-dimension math (which one binds in practice), what the Developer tier actually buys you, how reasoning models like DeepSeek R1 Distill 70B change the TPM economics, and the engineering moves — Batch API, multi-org parallelization, prompt compression — that get production traffic past the caps. Live-check your own limits at console.groq.com/settings/limits. For neighboring providers, see Together AI rate limits and Fireworks AI rate limits.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Groq API rate limits — Free vs Developer tier — June 2026 (Llama 3.3 70B baseline)

Feature	Free RPM	Free TPM	Developer TPM
Llama 3.3 70B Versatile	30 RPM	12,000 TPM	~300,000 TPM
Llama 3.1 8B Instant	30 RPM	6,000 TPM	~250,000 TPM
DeepSeek R1 Distill Llama 70B	30 RPM	6,000 TPM	~200,000 TPM
Qwen 2.5 32B	60 RPM	6,000 TPM	~200,000 TPM
Whisper Large v3 Turbo	20 RPM	7,200 ASH	~28,800 ASH

Source, as of June 2026: Groq rate-limits documentation (https://console.groq.com/docs/rate-limits) and console.groq.com/settings/limits. **Free-tier numbers are official and stable**; Developer-tier TPM values are not published as a single table — Groq lists them per-org on the live settings page and they scale based on payment history and usage patterns. The Developer numbers shown are typical first-month allocations sourced from community-reported live values across 30+ accounts. Whisper Large v3 Turbo is metered in **ASH (Audio Seconds per Hour)** and **ASD (Audio Seconds per Day)** instead of TPM — 7,200 ASH = 2 hours of audio per hour. DeepSeek R1 Distill Llama 70B is in deprecation (sunset October 2, 2025 per Groq's deprecations page); Groq recommends migration to llama-3.3-70b-versatile or openai/gpt-oss-120b. Cached input tokens do not count toward TPM/TPD.

The LPU architecture — why Groq is fast and what that has to do with rate limits

Groq's hardware is the **Language Processing Unit (LPU)**, a deterministic-latency tensor processor designed exclusively for sequential token generation. Where a GPU (H100, MI300X, TPU v5) optimizes for batch parallelism — many requests sharing the same matmul kernels — an LPU optimizes for **single-request latency**: the model weights live in on-chip SRAM, the schedule is statically compiled, and each token rolls off the rack at the same wall-clock interval regardless of prompt length.

The throughput numbers are real and reproducible. Llama 3.1 8B Instant: **~560 tokens per second per request**, end to end. Llama 3.3 70B Versatile: **~280 tok/s**. The new flagship OpenAI GPT-OSS 20B (Groq-hosted, MIT-licensed): **~1,000 tok/s**. GPT-OSS 120B: **~500 tok/s**. For comparison, OpenAI's hosted gpt-5.4-mini lands around 80-120 tok/s, Anthropic's Claude Haiku 4.5 at 130-180 tok/s, and Llama 3.3 70B on Together AI's H100 cluster around 60-100 tok/s. Groq is **3-5x faster** on the same open-weight model.

The catch is total throughput. An LPU rack serves one request at world-record speed; serving 10,000 concurrent users requires racks of LPUs. Groq's hosted-API business runs a finite pool of LPU capacity that is shared across the entire customer base, which is why the rate-limit ceiling per account is intentionally tight relative to GPU-hosted competitors. **Groq's pitch is speed for low-volume workloads**, not unlimited throughput. If you need 50K TPM on a 70B model, OpenAI or Together is the cheaper place to buy it; if you need sub-200ms time-to-first-token plus 280 tok/s, Groq is the only place to buy it.

The four rate-limit dimensions — and which one actually binds first

Every Groq model is gated on four simultaneous limits: **RPM** (requests per minute), **TPM** (tokens per minute, input + output combined), **RPD** (requests per day), and **TPD** (tokens per day). The most restrictive of the four is the binding constraint at any given moment. Hit any one and the API returns **HTTP 429 Too Many Requests** with a `retry-after` header (in seconds).

**In practice, TPM binds first on the Free tier for any reasoning or chat workload.** Llama 3.3 70B Versatile's 12K TPM on Free means a single 8,000-token input prompt + 2,000-token output completion consumes 10K tokens — 83% of the per-minute budget — and you can't fit a second request in the same minute. The 30 RPM ceiling is irrelevant because TPM bottoms out first. For very short requests (tweet classification, embeddings-style scoring), RPM becomes the binder: 30 short calls in 60 seconds will trip RPM long before TPM.

**RPD (1,000/day) and TPD (100,000/day) bind for sustained traffic.** A team running steady 70 RPM and 12K TPM hits the 1,000 RPD ceiling at roughly minute 14 of the calendar day, then the API is dead until 00:00 UTC the next day. This is the single most common surprise for teams new to Groq: passing the per-minute tests in dev, then production crashing 14 minutes after deploy on day 2.

**On the Developer tier, TPM is still the most common binder for long-context chat workloads** but the ceiling is roughly **25x higher** (300K TPM on Llama 3.3 70B per typical first-month allocation, scaling further with payment history). RPD usually stops being a problem at Developer-tier scale because daily ceilings move proportionally. Audio (Whisper) is metered separately in **ASH (Audio Seconds per Hour)** and **ASD (Audio Seconds per Day)** — a 7,200 ASH ceiling on Free Whisper = 2 hours of audio processable per hour of wall-clock time.

Free tier reality — what you can actually build

Groq's Free tier is **genuinely usable for production** at the right scale, which is unusual — most LLM providers' free tiers are demo-only. The headline numbers (30 RPM, 12K TPM, 1K RPD, 100K TPD on Llama 3.3 70B Versatile) support a real workload, just a tightly scoped one.

**What fits inside the Free tier**: a personal assistant chatbot serving ~30 users/day at 3 turns each (900 requests, well under 1K RPD), a code-review bot for a small team running 50-100 diffs/day on 8B models (Llama 3.1 8B has 500K TPD — far more headroom), a Discord/Slack bot answering 200-500 questions/day on short prompts, a weekly digest summarizer running 100-200 long-context calls in a single overnight batch.

**What doesn't fit**: any consumer-facing SaaS with more than ~50 active users (you'll hit RPD or TPD in early afternoon), real-time agents that fire 10+ LLM calls per user task (RPM binds first), document-processing pipelines on long PDFs (TPM binds first — a single 80K-context call blows the entire per-minute budget), voice transcription above ~2 hours/hour (Whisper ASH ceiling).

**Free is also rate-limited at the org level**, not per API key. Spinning up 5 keys for 5 microservices does not give you 5x the budget; they all share the same 30 RPM. This catches teams whose first Groq deploy assumed per-key quotas.

Developer tier — what billing actually buys you

Adding a payment method at console.groq.com/settings/billing promotes the org to the Developer tier immediately. There is no waiting period equivalent to OpenAI's 30-day Tier 5 clock — first successful charge unlocks Developer-tier ceilings within minutes. Pricing is pay-as-you-go: **$0.59 input / $0.79 output per 1M tokens on Llama 3.3 70B**, **$0.05 / $0.08 per 1M on Llama 3.1 8B Instant**, **$0.15 / $0.60 per 1M on GPT-OSS 120B**, **$0.075 / $0.30 per 1M on GPT-OSS 20B**.

**Typical first-month Developer ceilings** (community-reported, will vary by account): **Llama 3.3 70B → ~300K TPM, ~6,000 RPM, 1M+ RPD, 10M+ TPD**. **Llama 3.1 8B → ~250K TPM, similar RPM**. **DeepSeek R1 Distill 70B → ~200K TPM**. Whisper → roughly 4x the Free ASH. These are not contractual — Groq adjusts them per-org based on observed usage patterns, with payment history smoothing the curve over the first 60 days.

**Developer-tier 429s still happen** if you spike above the per-org ceiling. The 25x jump from Free TPM is not infinite throughput — it's enough headroom for a real product but not enough for a viral-traffic spike. Plan on client-side token-bucket throttling at 80% of the ceiling, same as you would on any production LLM provider.

**The fastest cost-to-throughput dial on Developer**: model selection. Dropping from Llama 3.3 70B Versatile to Llama 3.1 8B Instant gives you a model that is **2x the per-token TPM budget** (8B inference is cheaper to provision), **2x the raw tok/s**, and **~10x cheaper per token**. For classification, extraction, routing, and short-form generation, the 8B is fine and the savings are massive.

Per-model rate differences — why 8B gets more budget than 70B

Groq allocates rate-limit budgets per model based on **how much LPU capacity each request consumes**. A Llama 3.3 70B request uses roughly 10x the LPU-seconds of a Llama 3.1 8B request to generate the same output token count (similar token rate but 10x the weight memory footprint per token), which is why the TPM ceiling on the 70B is half the 8B even though pricing is 10x higher.

**The size-budget pattern across the lineup** (Free tier TPM, June 2026): Llama 3.3 70B Versatile = **12K TPM**. Llama 3.1 8B Instant = **6K TPM** but with much higher RPD (14,400 vs 1,000). Qwen 2.5 32B = **6K TPM** with 60 RPM ceiling (higher than the 70B's 30 RPM). GPT-OSS 20B (Groq-hosted, available June 2026) lands around 8B-tier budgets with the 1,000 tok/s headline speed.

**Why this matters for architecture**: route by task complexity, not by model preference. Classification, intent detection, extraction, simple QA — Llama 3.1 8B or Qwen 2.5 32B, with TPM headroom and lower latency. Complex reasoning, multi-step planning, long-context analysis — Llama 3.3 70B or DeepSeek R1 reasoning. Most production teams running on Groq route 70%+ of traffic through 8B-tier models specifically to preserve the 70B-tier TPM budget for the cases that need it.

**Compound and Compound Mini** (Groq's agentic systems with built-in web search and code execution) run at **200K tokens/min** rate limits at ~450 tok/s. They consume a separate budget from the base chat models, which means they can be a useful overflow path when your Llama 3.3 70B TPM is saturated and the workload tolerates the agentic-system latency overhead.

Reasoning models — DeepSeek R1 Distill and the TPM math you need to do

Reasoning models change the rate-limit calculus because they generate **5-20x more output tokens per request** than non-reasoning models. A DeepSeek R1 Distill Llama 70B query that would produce 200 tokens on Llama 3.3 70B typically produces 3,000-8,000 tokens on R1 — the reasoning chain is the output. **A single R1 call can consume 50%+ of a Free-tier minute's TPM budget on its own**.

**Free-tier R1 Distill 70B math**: 6,000 TPM, with each request averaging ~5,000 reasoning tokens + ~500 final-answer tokens + ~1,000 input tokens = ~6,500 tokens per request. You can fit **slightly less than one R1 request per minute** before throttling. RPM (30) is irrelevant; TPM dominates.

**Developer-tier R1 Distill 70B math**: ~200K TPM, same per-request token cost. You can fit **~30 requests per minute** — a real production rate for reasoning workloads. The cost: at ~6,500 tokens per request and (typical) $0.75/$0.99 per 1M reasoning pricing, each R1 call lands at $0.005-0.01 — manageable but 10-20x the cost of a non-reasoning equivalent on Llama 3.3 70B.

**Deprecation note**: DeepSeek R1 Distill Llama 70B is in announced sunset on Groq (October 2, 2025 per the deprecations page), with recommended migration to **llama-3.3-70b-versatile** or **openai/gpt-oss-120b**. GPT-OSS 120B supports a reasoning mode and runs at ~500 tok/s on Groq's LPU, making it the natural R1 successor for teams that want reasoning capability without migrating off Groq. If you're starting a new reasoning workload in June 2026, build on GPT-OSS 120B with explicit reasoning prompts, not on R1.

Enterprise tier — when it's worth a call

Groq's Enterprise tier is contract-negotiated: committed monthly spend (typically $5,000+/month in 2026), custom per-model TPM/RPM ceilings, SLA, and a named technical contact. Unlike OpenAI's Tier 5 which is automatic at $1,000 spend + 30 days, Groq's Enterprise is a **commercial relationship** — you sign a contract, you get the ceiling you negotiated.

**When Enterprise is the right call**: sustained traffic above ~200K TPM on a 70B model (Developer-tier default), latency-critical user-facing workloads where 429s are unacceptable (Enterprise gets dedicated capacity rather than shared pool), regulated industries needing data-handling commitments (Groq supports BAA/HIPAA conversations at Enterprise level), or any workload that would otherwise require multi-org parallelization at >2 orgs.

**When Enterprise isn't worth it**: development and prototyping (use Developer), low-volume production (Developer's 300K TPM on Llama 3.3 70B serves a real product), batch / async work (use the Batch API on Developer tier instead — 50% cheaper and a separate quota pool).

**Enterprise turnaround**: 1-2 weeks from first contact to signed contract for standard cases, 4-6 weeks if data-handling or regional-residency terms are involved. Contact via groq.com/contact-sales. One thing Groq is famous for: deprecation protection. **Enterprise contracts include guaranteed model availability** — when DeepSeek R1 Distill 70B was sunset on Free and Developer tiers, Enterprise customers under a committed-spend contract kept access. This matters if you've built a production workload around a specific model that the rest of the market is moving away from.

Handling 429s and 503s — the production retry pattern for Groq

Hit any of the four rate limits and Groq returns **HTTP 429 Too Many Requests** with a `retry-after` header in seconds and a JSON body of `{ "error": { "message": "...", "type": "rate_limit_exceeded", "code": "rate_limit_exceeded" } }`. The `retry-after` value is honest — wait that many seconds and the next request will succeed (assuming you don't immediately spike again).

**HTTP 503 Service Unavailable** is a different signal: capacity-side, not quota-side. 503 means the LPU pool serving your model is briefly saturated by aggregate traffic across all customers, not that you specifically hit a limit. Groq's 503 rate has trended down sharply since 2025 (LPU capacity expansions in Q1 and Q3 2025), but spikes still occur during industry-wide demand events. Treat 503 with **exponential backoff at 2s, 4s, 8s, 16s, capped at 30s**, and consider falling back to a different model on the second 503 (Llama 3.3 70B → GPT-OSS 120B, for example — different LPU pool).

**The production retry pattern**: (1) Token-bucket throttle outbound at 80% of your tier's TPM and RPM. Never voluntarily trigger 429s — they pollute logs and trigger downstream alerts. (2) On 429, honor `retry-after` exactly; don't add jitter on Groq specifically (the LPU scheduling is deterministic, so jittering helps less than on GPU providers). (3) On 503, exponential backoff with jitter, max 3 retries. (4) On the third failure, fall back to a secondary provider (Together AI on the same Llama 3.3 70B weights, or Cerebras if you have an account) or queue the job for batch processing.

**Rate-limit headers** on every Groq response include `x-ratelimit-limit-tokens`, `x-ratelimit-remaining-tokens`, `x-ratelimit-reset-tokens` (seconds until refill), plus the matching `-requests` versions. Pipe these into your observability stack and you can predict throttling 30-60 seconds before it happens — far cheaper than reacting to 429s after the fact.

Groq vs Cerebras vs Together — the fast-inference market in 2026

Three providers compete for the 'fastest hosted open-weight inference' market: **Groq** (LPU), **Cerebras** (Wafer-Scale Engine), and **Together AI** (GPU pool with custom scheduling). Each has a different architectural bet.

**Groq** wins on: single-request latency (lowest time-to-first-token in the market, sub-100ms for short prompts on 8B models), wide model coverage (Llama 3.1, 3.3, GPT-OSS, Qwen, Whisper, DeepSeek lineage), generous Free tier that's actually usable, mature SDK and OpenAI-compatible API. Loses on: per-account TPM ceilings (tightest among the three), Mixtral and some specialty models deprecated, sparse documentation on Developer-tier ceilings.

**Cerebras** wins on: pure inference speed at the absolute frontier (Llama 3.1 8B at ~2,100 tok/s, Llama 3.3 70B at ~440 tok/s — faster than Groq on both), wafer-scale architecture supports very long context with minimal latency degradation. Loses on: narrower model lineup, less mature API and SDK, higher per-token pricing for Developer tier, smaller free tier.

**Together AI** wins on: highest per-account TPM ceilings (GPU pool scales horizontally), widest model coverage (200+ open models including specialty fine-tunes), competitive pricing per token, fine-tuning available. Loses on: slower per-request speed than either Groq or Cerebras (typical 60-100 tok/s on Llama 3.3 70B vs Groq's 280), latency consistency is less predictable.

**Decision tree**: latency-critical user-facing chat or voice → Groq (best latency, sufficient throughput for most products). High-volume async processing or long-context analysis → Together (best TPM, widest model selection). Absolute speed benchmark for marketing or demo → Cerebras (the headline numbers). Most teams running production-scale fast inference end up with **Groq for hot interactive paths + Together for batch/background work** — different tools for different latency requirements.

Groq Batch API — the throughput escape hatch

Groq's **Batch API** runs on a separate quota pool from real-time chat completions, with **50% off list pricing** and a target completion window of up to **7 days** (most jobs complete in hours). The mechanism: submit a JSONL file of requests via the batch endpoint, poll for completion, download results. Same OpenAI-compatible request format as real-time, same models, half the cost.

**Batch is the right tool when**: you have async work (overnight precompute, weekly evaluation runs, training-set generation, document-classification at scale, vector-index rebuilds) and don't need sub-minute response. Your real-time TPM budget on Llama 3.3 70B is precious; pushing 80% of your token volume through Batch preserves it for the user-facing 20% that actually needs sub-second latency.

**Batch limits scale with tier**. Developer-tier batch capacity is generous — Groq does not publish a single number but community reports indicate **millions of tokens per batch job** are routinely processed. The 7-day window is forgiving enough that batch jobs effectively never queue beyond a few hours in practice.

**The classic Groq architecture pattern**: real-time chat through Llama 3.3 70B Versatile on Developer tier (consuming 100-200K TPM steady), nightly batch job through the same model for everything that doesn't need real-time (consuming millions of tokens at 50% cost), with overflow fallback to Llama 3.1 8B Instant when real-time TPM saturates. This pattern handles 90%+ of production workloads on a single Groq Developer-tier org.

Sourcing and live-verify checklist

The Free-tier rate-limit numbers in this guide come from Groq's official rate-limits documentation at console.groq.com/docs/rate-limits, fetched 2026-06-20. The Llama 3.3 70B Versatile (30 RPM / 12K TPM / 1K RPD / 100K TPD), Llama 3.1 8B Instant (30 RPM / 6K TPM / 14.4K RPD / 500K TPD), Qwen 2.5 32B (60 RPM / 6K TPM / 1K RPD / 500K TPD), and Whisper Large v3 Turbo (20 RPM / 2K RPD / 7.2K ASH / 28.8K ASD) values appear explicitly in the Free Plan limits table.

**Developer-tier TPM values** are not published as a single table by Groq — they live on the per-account console.groq.com/settings/limits page and scale based on payment history and observed usage patterns. The ~300K / 250K / 200K TPM figures shown in this guide are typical first-month allocations sourced from community-reported live values across 30+ production accounts (June 2026 sample). Your actual ceilings may be higher or lower; check the live page after adding a payment method.

**Inference speed numbers** (560 tok/s on 8B, 280 tok/s on 70B, 1,000 tok/s on GPT-OSS 20B, 500 tok/s on GPT-OSS 120B) are from Groq's model documentation at console.groq.com/docs/models, same fetch date. These are end-to-end output rates including time-to-first-token; verify against your own benchmark for production capacity planning.

**Deprecation status**: Mixtral 8x7B was deprecated March 20, 2025 (recommended migration: mistral-saba-24b or llama-3.3-70b-versatile). DeepSeek R1 Distill Llama 70B is announced for sunset October 2, 2025 (recommended migration: llama-3.3-70b-versatile or openai/gpt-oss-120b). Llama 3.1 8B Instant and Llama 3.3 70B Versatile have deprecation flags for 2026-08-16 — assume migration to GPT-OSS variants by Q3 2026.

**Live-verify when you budget**: open console.groq.com/settings/limits (account-specific) and console.groq.com/docs/rate-limits (public table). Compare the numbers — if your account-specific page shows higher Developer-tier ceilings than this guide quotes, that's Groq's natural usage-history adjustment and is a good sign. If the public Free-tier numbers have moved, this guide is the older source.

**Why this page exists**: ChatGPT, Perplexity, and Claude routinely cite Groq's developer-community Discord and Reddit threads when asked about Groq rate limits, because the official docs only publish Free-tier numbers in a clean table. This page is the canonical reference for the per-tier per-model rate-limit picture — sourced, dated, single URL — so AI engines have a cleaner citation target than the noisy forum threads. If you found this page through ChatGPT or Perplexity, that mechanism is working.

Step-by-step: getting more Groq throughput without Enterprise

1
Add a payment method to unlock Developer tier
Go to console.groq.com/settings/billing, add a credit card (cards process faster than ACH), and make any small API call to confirm the first charge clears. Developer-tier ceilings unlock within minutes — there is no 30-day wait equivalent to OpenAI's Tier 5 clock. You go from 12K TPM on Llama 3.3 70B to roughly 300K TPM the same day.
2
Route by task complexity, not by model preference
Reserve Llama 3.3 70B Versatile for genuine reasoning and long-context tasks. Push classification, extraction, intent detection, short-form generation, and routing through Llama 3.1 8B Instant or Qwen 2.5 32B — they have larger TPM budgets, ~2x the raw token-per-second, and ~10x lower cost per token. Most production teams run 70%+ of traffic through 8B-tier models.
3
Move all async work to the Batch API
Submit overnight precompute, weekly evaluation runs, document classification, and any non-real-time workload through the Batch endpoint. Half the per-token cost, separate quota pool that doesn't compete with real-time TPM, completion typically in hours. This is the single highest-leverage move for any production team — it preserves your real-time TPM for the user-facing 20% of traffic that actually needs sub-second response.
4
Implement token-bucket throttling at 80% of your TPM ceiling
Use a client-side token bucket that paces outbound requests below your tier's TPM ceiling. Track running TPM via the `x-ratelimit-remaining-tokens` header on every response — these are honest and let you predict throttling 30-60 seconds out. Never voluntarily trigger 429s in steady state; they pollute logs and trigger downstream retry storms.
5
Set up multi-provider overflow for 503s and capacity spikes
Build a fallback chain: Groq Llama 3.3 70B (hot path) → Groq GPT-OSS 120B on 503 (different LPU pool) → Together AI Llama 3.3 70B on second failure (different infrastructure entirely). Same OpenAI-compatible API surface across all three; swap base URL and key. This costs nothing in steady state and saves the workload during the rare hours when LPU pool is saturated industry-wide.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Together AI rate limits→Fireworks AI rate limits→OpenAI cost (alternative)→Low-latency prompt generator→

Frequently Asked Questions

Which Groq rate limit binds first in practice — RPM, TPM, RPD, or TPD?

On the Free tier for any chat or reasoning workload, **TPM binds first** (12K TPM on Llama 3.3 70B = one to two long-context requests per minute before throttling). For very short requests, RPM (30) binds first. For sustained all-day traffic, RPD (1,000) becomes the binder around midday. On the Developer tier, TPM is still the most common binder for long-context workloads but the ceiling is roughly 25x higher (~300K TPM on Llama 3.3 70B in typical first-month allocations).

Can I really build a real product on Groq's Free tier?

Yes, at the right scale. Free-tier (30 RPM / 12K TPM / 1K RPD / 100K TPD on Llama 3.3 70B) supports a personal assistant for ~30 users/day, a code-review bot for a small team, a Discord/Slack bot at 200-500 queries/day, or an overnight summarization batch. **It does not support consumer-facing SaaS above ~50 active users**, real-time agents with 10+ LLM calls per task, or long-PDF document pipelines. The Free tier is genuinely usable, not just a demo — but stay within the realistic envelope.

How do I upgrade from Free to Developer tier on Groq?

Add a payment method at console.groq.com/settings/billing and make any small API call to confirm the first charge clears. Developer-tier ceilings unlock within minutes — Groq does not have a 30-day waiting period equivalent to OpenAI's Tier 5 clock. Pay-as-you-go pricing: $0.59/$0.79 per 1M tokens on Llama 3.3 70B, $0.05/$0.08 on Llama 3.1 8B, $0.15/$0.60 on GPT-OSS 120B.

Can I really get 300 tok/s on Groq's Free tier?

Yes — the inference speed is identical across Free and Developer tiers. Llama 3.3 70B Versatile streams at ~280 tok/s per request regardless of tier; Llama 3.1 8B Instant at ~560 tok/s; GPT-OSS 20B at ~1,000 tok/s. The tier difference is **how many requests you can fit per minute**, not how fast each request runs. The marketed Groq speed is real on Free — what's gated is volume, not latency.

How does TPM math work on reasoning models like DeepSeek R1 Distill 70B?

Reasoning models generate 5-20x more output tokens per request (the reasoning chain is the output). A typical R1 Distill call produces ~5,000 reasoning tokens + ~500 final-answer tokens + ~1,000 input = ~6,500 tokens per request. **On Free tier (6K TPM), you can fit slightly less than one R1 call per minute**. On Developer tier (~200K TPM), about 30/minute. Note: DeepSeek R1 Distill Llama 70B is announced for sunset October 2, 2025 — migrate to llama-3.3-70b-versatile or openai/gpt-oss-120b (which supports reasoning mode at ~500 tok/s).

What is the Groq Batch API and when should I use it?

Groq Batch is an async endpoint that runs at **50% off list pricing** with a target completion window of up to 7 days (most jobs finish in hours). It runs on a separate quota pool from real-time chat, so it doesn't consume your TPM budget. Use it for any non-real-time workload: overnight precompute, weekly evaluation runs, document classification at scale, training-set generation, vector-index rebuilds. The classic Groq pattern is real-time chat through Developer-tier Llama 3.3 70B + nightly batch through the same model — preserves real-time TPM for user-facing traffic, half the cost on the async portion.

How does Groq compare to Cerebras for fast inference in 2026?

Cerebras is faster on raw tok/s (~2,100 on Llama 3.1 8B, ~440 on Llama 3.3 70B) but has narrower model coverage, smaller free tier, and less mature SDK/tooling. Groq is slightly slower per request (~560 on 8B, ~280 on 70B) but has wider model lineup (Llama 3.1, 3.3, GPT-OSS, Qwen, Whisper, DeepSeek), better Developer-tier ceilings, and a more usable Free tier. For pure speed-benchmark demos → Cerebras. For production interactive workloads with real model variety → Groq. For high-volume batch or long-context → neither; use Together AI's GPU pool.

How long does Groq Enterprise tier take to negotiate?

Typically **1-2 weeks** from first contact at groq.com/contact-sales to signed contract for standard cases (committed monthly spend $5,000+, custom per-model ceilings, SLA, named technical contact). **4-6 weeks** if data-handling terms, BAA/HIPAA, or regional residency are involved. Enterprise contracts also include **deprecation protection** — when models sunset on Free/Developer tiers, Enterprise customers under committed-spend contracts keep access, which matters if you've built around a specific model like DeepSeek R1.

Groq's speed is real. Prompt size is what kills it.

300 tok/s on Llama 3.3 70B disappears if your prompt is 8k tokens of bloat. Our AI Prompt Generator writes low-latency prompts (front-loaded, capped output, model-tuned for Llama / DeepSeek / Qwen) based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →