Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LLM Rate Limits 2026: RPM, TPM, and Concurrency Caps Across Every Provider

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

LLM providers cap usage three ways: requests per minute (RPM), tokens per minute (TPM), and (sometimes) concurrent requests. The caps scale with usage tier — most providers auto-promote accounts based on cumulative spend and time, while a few require contacting sales. As of June 2026, RPM caps range from 60 (free trial tiers) to 30,000+ (high-tier enterprise) and TPM caps range from 30,000 to 100,000,000+, with concurrent request limits running 50-1,000 on flagship models.

Hitting rate limits is the single most common production incident with LLM APIs. The error returns instantly (HTTP 429), but the workload often does not recover gracefully — retries pile up, latency spikes, and downstream queues backfill. Below is the per-provider, per-tier table sourced from each vendor's docs, plus worked examples of when typical workloads hit which cap. For cost-side workload planning paired with these limits, see our GPT vs Claude vs Gemini cost calculator, or grab the free PDF rate-limit cheat sheet.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

LLM rate limits by provider and tier — June 2026 (flagship-tier models)

Feature
RPM (requests/min)
TPM (tokens/min)
Concurrent / Batch
Tier promotion criteria
OpenAI Tier 1 (free)50030,000StandardAccount creation
OpenAI Tier 2 ($50+ paid)5,000450,000Standard$50 cumulative spend, 7+ days
OpenAI Tier 3 ($100+ paid)5,000800,000Standard$100 cumulative spend, 7+ days
OpenAI Tier 4 ($250+ paid)10,0002,000,000Standard$250 cumulative spend, 14+ days
OpenAI Tier 5 ($1k+ paid)30,00030,000,000Standard$1,000 cumulative spend, 30+ days
Anthropic Tier 15040,000 (in) / 8,000 (out)Account creation
Anthropic Tier 21,00080,000 (in) / 16,000 (out)$40 deposit, 7+ days
Anthropic Tier 32,000160,000 (in) / 32,000 (out)$200 deposit, 14+ days
Anthropic Tier 44,000400,000 (in) / 80,000 (out)$400 deposit, 30+ days
Anthropic Custom (enterprise)NegotiatedNegotiatedContact sales
Google Gemini Free10 (2.5 Flash) / 5 (2.5 Pro)1,000,000 (Flash) / 250,000 (Pro)Free tier
Google Gemini Paid Tier 12,000 (Flash) / 1,000 (Pro)4,000,000 (Flash) / 2,000,000 (Pro)Billing enabled
Google Gemini Paid Tier 210,000 (Flash) / 5,000 (Pro)10,000,000 (Flash) / 5,000,000 (Pro)$250 cumulative spend, 30+ days
Google Gemini Paid Tier 330,000+ (negotiated)100,000,000+ (negotiated)Contact sales / Vertex AI
Mistral Free Tier1 RPS (60 RPM)500,000Account creation
Mistral Pro Tier5,0002,000,000Paid plan
Together AI Standard6,000Model-dependent200-500 concurrentPaid account
Together AI DedicatedUnlimited (capacity-bound)Unlimited (capacity-bound)Reserved capacityDedicated endpoint plan

Sources, as of June 2026: OpenAI rate limits (https://platform.openai.com/docs/guides/rate-limits), Anthropic rate limits (https://docs.claude.com/en/api/rate-limits), Google Gemini rate limits (https://ai.google.dev/gemini-api/docs/rate-limits), Mistral rate limits (https://docs.mistral.ai/deployment/laplateforme/tier/), Together AI rate limits (https://docs.together.ai/docs/rate-limits). RPM and TPM caps apply per model; high-volume models often have separate higher caps than newer or premium models. Confirm against each provider's live page before designing a workload — tier definitions and promotion criteria change often.

The three limits every provider enforces

Requests per minute (RPM) caps how many API calls you can issue in a 60-second window. The cap resets on a rolling basis — burst behavior is allowed within the window, but sustained high RPM triggers 429s. Most production workloads hit RPM caps first.

Tokens per minute (TPM) caps the total tokens (input + output, on most providers; some count input-only) flowing through your account per minute. Long-context calls eat the TPM budget quickly: a single 200k-input call on a 200k TPM cap leaves zero budget for other requests in that minute.

Concurrent requests caps how many requests can be in-flight simultaneously. OpenAI does not publish a hard concurrency cap on standard tiers (limited indirectly by TPM/RPM). Together AI publishes 200-500 concurrent on the standard tier. Hitting concurrency caps shows up as a different error path than RPM/TPM — typically a 503 instead of a 429.

All three caps reset per model. gpt-5.5 and gpt-5.4-mini have independent quotas; running gpt-5.5 at the cap does not affect your gpt-5.4-mini headroom. This is useful for fallback patterns — see the resilience section below.


Worked example 1: when does a chatbot hit the cap?

Reference workload: a customer-support chatbot averaging 1,500 input + 500 output tokens per call.

On OpenAI Tier 2 (gpt-5.5: 5,000 RPM / 450,000 TPM): 5,000 RPM is the binding constraint at this token shape, since 5,000 calls × 2,000 tokens = 10M tokens/min — well above TPM. So the cap is 5,000 calls/min = 83 calls/second. A burst of 100 concurrent users sending one message each, with the model taking ~5 seconds to respond, sits comfortably under the cap.

Same workload on Anthropic Tier 2 (Claude Sonnet 4.6: 1,000 RPM / 80,000 input TPM / 16,000 output TPM): 1,000 RPM ÷ 60 = 17 RPS. But input TPM is the real bind here — 1,000 calls × 1,500 input tokens = 1.5M input tokens, well above 80k TPM. The actual cap is 80,000 / 1,500 = 53 calls/min on input — far tighter than the headline 1,000 RPM. You either upgrade to Tier 3 or move the chatbot to a higher tier model with looser caps.

On Google Gemini Paid Tier 1 (Gemini 2.5 Pro: 1,000 RPM / 2,000,000 TPM): 2M TPM / 2k tokens per call = 1,000 calls/min — exactly matching RPM. Tier 1 sustains roughly 17 calls/second; sufficient for a small-to-mid app.

Plan for the binding constraint, not the headline number. TPM frequently caps before RPM on long-context workloads.


Worked example 2: batch jobs and concurrency

Reference workload: a one-shot enrichment of 1M records, each requiring a 500-token-in / 100-token-out classification call.

Synchronous on OpenAI Tier 4 (10,000 RPM / 2,000,000 TPM): 10k RPM ÷ 60 = 167 RPS. 1M calls / 167 RPS = ~100 minutes of sustained burst — or 1 hour 40 minutes if you can run flat-out. TPM at 600 tokens × 10k calls = 6M, well above the 2M TPM cap, so TPM is the bind. Real throughput: 2M TPM / 600 tokens = 3,333 calls/min, so 1M calls / 3,333 = 300 minutes = 5 hours.

Same job on the Batch API: submit 1M calls in a JSONL file, get results in up to 24 hours, at 50% off both input and output. No RPM or TPM concern — the batch queue handles throttling internally. Cost drops from $0.005 × 1M = $5,000 (gpt-5.4-mini standard) to $2,500.

For one-off enrichment passes, batch is almost always the right answer — same cost cut as a synchronous tier upgrade, simpler ops, no rate-limit engineering. For continuous ingestion, synchronous on a higher tier is usually right.


How rate limits scale with usage tier

OpenAI auto-promotes between tiers based on cumulative spend and account age. Tier 1 → 2 at $50 in 7+ days, Tier 2 → 3 at $100 in 7+ days, Tier 3 → 4 at $250 in 14+ days, Tier 4 → 5 at $1,000 in 30+ days. The progression is automatic; no support ticket needed.

Anthropic uses deposits rather than spend. Tier 1 → 2 at $40 deposit in 7+ days, Tier 2 → 3 at $200 in 14+ days, Tier 3 → 4 at $400 in 30+ days. For higher caps, contact sales for a custom plan.

Google Gemini uses cumulative spend on the paid tier. Free tier is harshly limited (10 RPM on Flash, 5 on Pro). Paid Tier 1 is enabled on billing setup. Paid Tier 2 at $250 cumulative in 30+ days. Tier 3 requires contacting sales or moving to Vertex AI.

The practical implication: a production deployment should sit on Tier 3+ within the first month. If you launch on Tier 1 or 2 and traffic spikes, you will hit caps and 429s before tier auto-promotion kicks in. The fastest way to skip the wait is to deposit the full amount upfront — most providers honor the higher tier within hours of detection.


What happens when you hit a limit

All major providers return HTTP 429 (Too Many Requests) when an RPM, TPM, or concurrency cap is exceeded. The response includes Retry-After in seconds, which is the suggested backoff before retrying. Honoring Retry-After is the difference between a graceful degradation and a cascading queue backlog.

Bad retry pattern: immediate retry without backoff. Causes the same call to fail repeatedly and amplifies the load on the provider's rate-limit system. Often triggers a temporary IP ban on aggressive retry storms.

Good retry pattern: exponential backoff with jitter. Start at the Retry-After value (or 1 second if missing), double on each retry up to a max (typically 60 seconds), add 0-25% random jitter to prevent thundering herd. Most production HTTP clients (the OpenAI SDK, anthropic SDK, google-generativeai SDK) implement this by default; verify it is enabled.

Better pattern: rate-limit awareness at the queue level. If you have 10,000 calls to make and a 5,000 RPM cap, spread them across 2+ minutes proactively rather than firing all 10k and letting half 429. Use a leaky-bucket or token-bucket rate limiter at your API client layer.

Best pattern at scale: a multi-tier fallback chain. Primary model on its own quota, secondary (cheaper) model on its own quota for overflow, batch queue for non-urgent traffic. When primary 429s, fall back to secondary; when secondary 429s, drop to batch.


Resilience patterns for hitting limits gracefully

Pattern 1: model fallback. Each model has independent quotas. When gpt-5.5 RPM caps, retry on gpt-5.4. When Claude Sonnet 4.6 caps, retry on Claude Haiku 4.5. Quality drops slightly but availability stays at 100%. Implement with a simple retry-on-429 router in your client.

Pattern 2: provider fallback. Cross-provider redundancy with the AI Gateway or Portkey or custom routing. Primary on OpenAI, secondary on Anthropic, tertiary on Gemini. When one provider has an outage or rate-limits, route to the next. Adds eval complexity (each provider's responses differ slightly) but eliminates single-provider risk.

Pattern 3: client-side throttling. Use a leaky-bucket rate limiter (e.g., aiolimiter in Python, bottleneck in Node) sized at 80% of your tier cap. Prevents bursting into 429s in the first place.

Pattern 4: spend-tier acceleration. If you are 6 days away from a tier promotion that would solve your rate issue, pre-deposit or make a one-time API call run to trip the promotion threshold faster.

Pattern 5: batch where possible. Anything not synchronous-user-facing belongs on the Batch API. Both OpenAI and Anthropic Batch endpoints have separate quota pools that do not affect your synchronous limits.

For the cost side of these patterns, see GPT vs Claude vs Gemini cost calculator, which compares fallback chains end-to-end.


Tier promotion: how to get higher limits fast

Method 1: spend through the threshold. The cheapest path: run real traffic to hit the cumulative-spend criterion. Burn the required dollar amount through legitimate workload over the required days. Most teams sit at the next tier within 30-60 days of launch.

Method 2: pre-deposit. Some providers (Anthropic) accept pre-deposits that count toward tier criteria immediately, accelerating promotion without waiting for usage to accumulate.

Method 3: contact sales. The fastest path for enterprise volume. OpenAI, Anthropic, Google, Mistral, and Together all have sales teams that can authorize custom higher-tier limits with a discussion of expected volume, use case, and term commitment. Lead time: typically days to weeks.

Method 4: dedicated endpoints. Together AI, Anthropic (via Bedrock), and Google (via Vertex AI) all offer reserved-capacity endpoints where rate limits effectively disappear in exchange for committed monthly capacity payments. Useful at sustained high volume with predictable load shapes.

Method 5: cross-account distribution. Some teams shard production traffic across multiple accounts (typically per-environment or per-feature). Each account gets its own quota. Be cautious — providers' terms of service usually prohibit using multiple accounts to evade caps; legitimate use cases (genuinely separate apps or environments) are fine.


Multi-region failover and the multi-cloud strategy for LLM rate limits

Rate-limit headroom is not a single number — it is a number per region per provider. Every major LLM provider exposes its flagship models through more than one endpoint, and each endpoint enforces its own independent RPM and TPM quota. A team running against only the default endpoint is leaving 2x to 3x of usable capacity on the table, often without realizing it. The multi-region pattern treats each regional endpoint as a parallel quota bucket and routes traffic across them with a failover policy.

Anthropic is the most flexible here. Claude is available on the direct Anthropic API, on AWS Bedrock in us-east-1, us-west-2, eu-west-1, eu-central-1, ap-southeast-1, ap-northeast-1, and several newer regions, and on Google Cloud Vertex AI in us-east5, europe-west1, and asia-southeast1. Each of those endpoints has a separate quota. A workload that hits the direct-API Tier 3 ceiling of 2,000 RPM can route overflow to Bedrock us-east-1 (separate per-account quota negotiated against AWS) and Vertex AI us-east5 (negotiated against GCP). The same underlying Claude Sonnet 4.6 model serves all three with the same prompt schema, so the eval-difference risk that exists in cross-provider fallback is effectively zero.

OpenAI is more constrained on the direct API — it presents one global endpoint with a single quota — but Azure OpenAI Service replicates GPT-5.x across regional deployments (East US, East US 2, West US, West US 3, North Central US, South Central US, North Europe, West Europe, Sweden Central, France Central, UK South, Japan East, Australia East, and others). Each Azure region has its own RPM and TPM quota assigned at deployment creation. A team blocked at OpenAI Tier 4's 10,000 RPM cap can deploy GPT-5.5 in three Azure regions at 3,000 RPM each and route between them, instantly adding 9,000 RPM of side-channel capacity without waiting for tier auto-promotion.

Google Gemini follows the same pattern through Vertex AI. The AI Studio API has one shared quota; Vertex AI publishes regional endpoints (us-central1, us-east1, us-east4, us-west1, europe-west1, europe-west4, asia-southeast1, asia-northeast1, and more), each with independent quotas configurable per project. Vertex AI quotas also tend to be higher than the AI Studio paid tier at the same spend level, so the migration is doubly worth it for high-volume workloads.

The math on a three-region setup rarely yields a perfect 3x. Imperfect load balancing — uneven traffic shapes, retry storms concentrating on the primary, region-pinned customers in regulated workloads — typically delivers a 2.6x to 2.8x effective multiplier on most realistic chatbot and ingestion workloads. Use 2.7x as a planning rule of thumb. A worked example: a chatbot at a 30,000 TPM ceiling per region, deployed primary in us-east-1, secondary in eu-west-1, tertiary in ap-southeast-1, sustains roughly 80,000 TPM aggregate before any region 429s. That is the equivalent of a full tier upgrade, achievable in hours rather than the 14 to 30 days a spend-based promotion would require, and with no minimum-deposit commitment.

Monitoring is the part teams underinvest in. Each region needs its own headroom dashboard, its own 429 rate alert, and its own retry budget tracked separately — aggregating across regions hides the region that is actually saturated. Tag every request with its target region at the client layer, log the regional rate-limit headers (Azure returns x-ratelimit-remaining-requests per deployment; Bedrock returns x-amzn-bedrock-quota-* headers; Vertex returns standard Google quota headers) into your observability stack, and graph each region as a separate series. The failover router should select the region with the highest remaining headroom rather than a fixed primary, which smooths utilization and pushes the effective multiplier closer to the theoretical 3x. For implementations on Vercel's AI Gateway, the regional routing logic can sit in a thin middleware layer in front of the gateway and pass through to the chosen endpoint.


Monitoring rate limit headroom

Most providers return rate-limit headers on every successful response. OpenAI: x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, x-ratelimit-reset-tokens. Anthropic: anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-remaining. Google: x-goog-api-client (less detailed; query the API for quota status).

Log these headers per request and build a dashboard showing rolling 1-minute and 5-minute headroom on RPM and TPM. When headroom regularly drops below 20% on a sustained basis, the tier is your real production cap; plan a promotion before traffic outgrows it.

Alert on three signals: 429 rate above 0.1% of total traffic, sustained sub-20% headroom for >5 minutes, and any 503 (concurrency) errors. Each signal indicates a different remediation: 429 = bump tier or smooth burst; sustained low headroom = tier upgrade required; 503 = lower concurrency in your client or upgrade to dedicated.

Cost monitoring should align: if your rate-limit dashboard shows you regularly bumping the TPM ceiling, you are at a tier where the marginal cost of upgrading is far smaller than the cost of dropped or delayed requests. For provider cost comparison at scale, see OpenAI API pricing and Anthropic Claude pricing.

Frequently Asked Questions

What is the difference between RPM and TPM?

RPM is requests per minute — how many API calls you can make. TPM is tokens per minute — total input + output tokens flowing through your account. TPM frequently caps before RPM on long-context workloads.

How do I increase my OpenAI rate limit?

OpenAI auto-promotes tiers based on cumulative spend: $50/7 days for Tier 2, $100/7 days for Tier 3, $250/14 days for Tier 4, $1,000/30 days for Tier 5. For higher limits, contact sales. Confirm current tier promotion criteria on OpenAI's rate limits page.

Why am I getting 429 errors?

A 429 means you hit one of three caps: requests per minute, tokens per minute, or concurrent requests. The error response includes Retry-After in seconds. Implement exponential backoff with jitter, honor Retry-After, and consider tier promotion or a rate-limiter on your client.

Does the Batch API have separate rate limits?

Yes. OpenAI and Anthropic Batch endpoints have separate quota pools that do not affect synchronous limits. You can run a large batch job without consuming any of your synchronous TPM or RPM headroom. Confirm against each provider's batch documentation.

What is the cheapest way to get higher rate limits?

Auto-tier-promotion via real spend is free — just keep using the API and the tier bumps automatically. Pre-depositing accelerates the timeline. For enterprise volume, dedicated endpoints (Together, Bedrock, Vertex) trade rate limits for capacity commitments.

Can I use multiple accounts to bypass rate limits?

Most providers' terms of service prohibit using multiple accounts to evade caps. Legitimate separation (per-environment, per-product) is fine; deliberately sharding to dodge limits is not. The right path is tier promotion or dedicated endpoints.

Do rate limits apply per model or across all models?

Per model on every major provider. Hitting your gpt-5.5 cap does not affect your gpt-5.4-mini or text-embedding-3-small headroom. This is the basis for model-fallback resilience patterns.

How do I monitor my rate limit headroom?

Most providers return rate-limit headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, etc.) on every response. Log them, build a rolling 1-minute and 5-minute headroom dashboard, alert below 20% sustained headroom. Bump tier before traffic outgrows the cap.

Does each AWS Bedrock or Azure OpenAI region have its own rate limit?

Yes. Bedrock quotas are set per AWS region and per model, so us-east-1 and eu-west-1 hold completely independent RPM and TPM caps for the same Claude model. Azure OpenAI quotas are assigned at deployment creation per region — East US, North Europe, Sweden Central, and so on each carry their own RPM and TPM. This is the basis for the multi-region failover pattern that effectively multiplies capacity without a tier promotion.

How much extra capacity does a multi-region setup actually deliver?

Plan for roughly 2.7x on a three-region deployment, not the theoretical 3x. Imperfect load balancing, retry concentration on the primary region, and region-pinned customers in regulated workloads cost about 10% of the headline number. For a workload capped at 30,000 TPM per region, expect to sustain about 80,000 TPM aggregate before any single region begins returning 429s.

Is Claude available on AWS Bedrock and Google Vertex AI with separate quotas?

Yes. Anthropic distributes Claude on the direct Anthropic API, AWS Bedrock (us-east-1, us-west-2, eu-west-1, eu-central-1, ap-southeast-1, ap-northeast-1, and others), and Google Cloud Vertex AI (us-east5, europe-west1, asia-southeast1). Each endpoint enforces its own RPM and TPM quota — and the model behavior is identical across them, so cross-endpoint fallback carries effectively zero eval drift.

Get the 2026 rate-limit cheat sheet

One-page PDF with every provider's tier-by-tier RPM, TPM, and promotion criteria — free, no signup gate.

Browse all prompt tools →