By The DDH Team · Digital Dashboard Hub

Gemini API Rate Limits 2026: Free Tier, Paid Tiers, Per-Model Quotas

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Google's Gemini API runs two parallel quota systems and most teams hit production without understanding which one applies to them. **Google AI Studio** (ai.google.dev / aistudio.google.com) uses a four-tier ladder — Free, Tier 1, Tier 2, Tier 3 — gated on cumulative paid usage and days since first payment. **Vertex AI** (cloud.google.com/vertex-ai) is a separate path with GCP project-level quotas requested through the Cloud Console. Same underlying models, different rate-limit reality.

As of June 2026, the AI Studio tier thresholds are: **Free** (active project, rate-limited but no payment required), **Tier 1** (billing account linked and active, no minimum spend), **Tier 2** ($250 cumulative paid + 3 days from first successful payment, with monthly cap raised to ~$2,000), **Tier 3** ($1,000 cumulative paid + 30 days from first successful payment, with monthly cap raised to ~$20,000-$100,000+ depending on use case). The per-model ceilings — RPM, TPM, RPD — scale with the tier and are live-visible at aistudio.google.com/rate-limit for the logged-in account.

Below: the canonical per-tier table for Gemini 2.5 Flash and 2.5 Pro, then 10 sections covering AI Studio vs Vertex AI, the free tier reality, the Tier 1 unlock (billing on, no minimum), Tier 2/3 thresholds and 30-day waits, per-model quota breakdown, thinking-mode + image-gen + context-caching impact on TPM, 429 + 503 handling, the Vertex AI path past Tier 3 ceilings, and Batch API + Context Caching as cost-reduction levers. For broader cost modeling see our OpenAI API cost calculator, Claude API cost calculator, and the sibling OpenAI Tier 5 unlock requirements page.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Gemini API rate limits by tier — June 2026

Feature	2.5 Flash RPM	2.5 Pro RPM	2.5 Pro RPD
Free	10 RPM	5 RPM	100 RPD
Tier 1	1,000 RPM	150 RPM	10,000 RPD
Tier 2	2,000 RPM	1,000 RPM	50,000 RPD
Tier 3	10,000 RPM	2,000 RPM	Unlimited

Source, as of June 2026: Google AI for Developers rate-limits documentation (https://ai.google.dev/gemini-api/docs/rate-limits) and the live account view at https://aistudio.google.com/rate-limit. Tier qualification: Free = active project, Tier 1 = billing enabled, Tier 2 = $250 paid + 3 days, Tier 3 = $1,000 paid + 30 days. Per-model RPM / RPD values shown are typical published ceilings for the indicated tier; the live account page reflects any account-specific adjustments and is authoritative for billing decisions. Gemini 2.5 Flash-Lite ceilings are materially higher than 2.5 Flash on every tier (e.g., 4,000 RPM at Tier 1, 30,000 RPM at Tier 3). All limits are organization-level, not per-API-key.

AI Studio quotas vs Vertex AI quotas — two different systems, same models

The single most expensive misunderstanding among teams shipping Gemini in 2026: assuming that hitting Tier 3 on Google AI Studio gives them the same effective throughput as a properly-quota'd Vertex AI project. It does not — they are two separate quota systems, with different request paths, different SDKs, different billing surfaces, and different ceilings.

**Google AI Studio** (the ai.google.dev / aistudio.google.com surface, called via the `google-genai` SDK or the `gemini` REST endpoint) uses the four-tier ladder. Quota is account-level (your Google identity + linked Cloud billing account). Promotion is automatic against cumulative paid spend + time. Region routing is opaque — Google picks the data-plane region.

**Vertex AI** (the cloud.google.com/vertex-ai surface, called via the `google-cloud-aiplatform` SDK or the Vertex REST endpoint) uses GCP project quotas. Quota is project-level and regional — you request `generate_content_requests_per_minute_per_project_per_base_model` for `us-central1` separately from `europe-west4`. There is no tier ladder. Default per-project quotas are typically generous (60 RPM on Gemini 2.5 Pro in most regions at project creation), and increases are requested through the Cloud Console quota page or through your Google Cloud account team.

Which surface to use: **AI Studio** if you are prototyping, building a SaaS feature with moderate traffic, or want the simplest billing relationship. **Vertex AI** if you need data-residency controls, VPC-SC perimeters, customer-managed encryption keys (CMEK), HIPAA / FedRAMP compliance, or throughput above what Tier 3 AI Studio comfortably provides. Most production teams above ~50M tokens/day end up on Vertex AI for the quota headroom alone.

The Gemini free tier is genuinely usable — not a 30-day trial

Unlike most LLM provider free tiers in 2026, Google's Gemini Free tier is a **permanent** rate-limited tier, not a credit-burn trial. You can run an active project against Gemini 2.5 Flash at **10 RPM and ~250k TPM** indefinitely without ever adding a payment method. There is no 30-day clock, no credit balance counting down to zero.

Real use cases that fit comfortably in the Free tier: internal CRUD-app summarization (a customer-success agent generating 200-300 prompts/day), a hobby chatbot serving 50-100 users with moderate engagement, a research / academic project with intermittent batch needs, a small-team coding assistant. Free tier also covers **Gemini 2.5 Pro** at a much tighter ceiling — **5 RPM, 100 RPD** — enough for evaluation runs and quality comparisons but not for production traffic.

The catch: **Free tier data is used to improve Google's products** (the standard ai.google.dev terms apply). For any application handling user PII, regulated data, or proprietary content, move to a paid tier even if your usage fits the free ceiling — paid tiers (Tier 1+) carry the data-not-used-for-training commitment. This is the same trade Anthropic and OpenAI run; Google is just more transparent about which surface carries which terms.

Free is also rate-limited at the *minute* and *day* level. Hitting **10 RPM** on Gemini 2.5 Flash is fine if you pace requests at one every 6 seconds; bursting 10 in second 1 then idling will return 429s on requests 11+ in that minute. The **100 RPD** ceiling on Gemini 2.5 Pro is the harder limit for evaluation work — you cannot run a 500-prompt eval against Pro on Free in a single day.

Tier 1: link a billing account, no minimum spend required

The Tier 1 unlock on Gemini is the cleanest 'first paid step' across all major LLM providers in 2026. The requirement: **a linked, active Google Cloud billing account**. There is no $5 minimum payment, no spend threshold, no waiting period — the moment you link billing and accept the paid-API terms, your account is Tier 1 and ceilings jump by 100-200x on every model.

Concrete jump: Gemini 2.5 Flash goes from **10 RPM (Free) → 1,000 RPM (Tier 1)**. Gemini 2.5 Pro goes from **5 RPM → 150 RPM**, and the daily Pro ceiling jumps from 100 RPD to **10,000 RPD**. For most teams shipping a production feature this is the only tier you need for the first 3-6 months.

Tier 1 also unlocks **monthly spend cap of $250** by default. The cap is a soft guard against runaway bills — you can raise it from the billing console once you've established usage patterns. Critically: Tier 1 carries the paid-API terms (no training on your data), so this is the minimum tier for any production app handling user content.

Step-by-step to Tier 1: (1) go to aistudio.google.com/api-keys → 'Set up Billing'; (2) link an existing Google Cloud billing account or create a new one; (3) accept the paid-API terms; (4) your default project is now Tier 1 — confirm at aistudio.google.com/rate-limit. Promotion is immediate, not delayed.

Tier 2 + Tier 3: the $250 / $1,000 thresholds and the 30-day wait

**Tier 2** unlocks at **$250 in cumulative paid usage + 3 days from first successful payment**. Both conditions must clear. The throughput gains are real — Gemini 2.5 Flash moves from 1,000 RPM (Tier 1) to **2,000 RPM**, Gemini 2.5 Pro moves from 150 RPM to **1,000 RPM**, and the Pro daily ceiling jumps from 10,000 to **50,000 RPD**. Monthly spend cap raises to **$2,000**.

**Tier 3** unlocks at **$1,000 in cumulative paid usage + 30 days from first successful payment**. This is the production tier. Gemini 2.5 Flash hits **10,000 RPM**, Gemini 2.5 Pro hits **2,000 RPM**, and the Pro daily request ceiling effectively goes **unlimited**. Monthly spend cap raises to **$20,000-$100,000+** depending on workload (the high end requires a brief account-team review).

The **30-day clock** on Tier 3 starts at your *first successful payment*, not at signup, not at the largest payment. A team that links billing in April, lets the account sit, then runs $1,200 of inference in early June will not promote to Tier 3 until 30 days after that first April payment cleared (so promotion happens on day 30 from April, well before the $1,200 was spent — assuming the cumulative threshold has also cleared by then). Most teams hit the $1,000 spend faster than the 30 days and end up waiting out the calendar.

Same pattern as OpenAI's Tier 5 wait: there is no shortcut on the standard path. Three legitimate workarounds: (a) negotiate enterprise quota with the Google Cloud account team before day 30; (b) front your traffic through a **Vertex AI** project, which has its own quota system independent of AI Studio tiers; (c) parallelize across multiple billing-isolated projects if your architecture supports it. Options (b) and (c) are far more common than (a) for mid-market teams.

Per-model rate limits: 2.5 Pro is the most-constrained, Flash-Lite is the most generous

Gemini's three 2.5-family models have materially different rate-limit profiles. **Gemini 2.5 Pro** — the highest-quality, longest-thinking model — has the tightest ceilings on every tier (5 RPM Free → 2,000 RPM Tier 3). **Gemini 2.5 Flash** — the production workhorse balancing quality, speed, and cost — sits in the middle (10 RPM Free → 10,000 RPM Tier 3). **Gemini 2.5 Flash-Lite** — the lowest-latency, lowest-cost model — has the most generous ceilings (typically 30 RPM Free → 30,000 RPM Tier 3, with proportionally higher TPM).

TPM ceilings track RPM but at different model-specific rates. Gemini 2.5 Pro is allotted high TPM per request (because Pro is typically called with long contexts and reasoning outputs) — on Tier 1 typically **2M TPM**, scaling to **8M TPM** at Tier 3. Gemini 2.5 Flash gets **1M TPM** on Tier 1 → **4M TPM** at Tier 3. Live values appear at aistudio.google.com/rate-limit for your account.

**RPD (requests per day)** is the silent ceiling. Many teams plan capacity against RPM, deploy, then discover their Pro evaluation job is rate-limited not by the per-minute ceiling but by the per-day cap — a single eval run of 8,000 prompts against Gemini 2.5 Pro on **Tier 1's 10,000 RPD** is fine, the same job at 12,000 prompts is not. Plan against the binding constraint for *your* workload.

Which ceiling binds first depends on workload shape: **interactive / chat** workloads typically hit RPM first (bursty, short contexts). **Batch / eval / summarization** workloads typically hit TPM first (sustained, long contexts). **Long-running evaluation** workloads typically hit RPD first. The right answer is to instrument all three and watch which utilization curve hits 80% earliest.

Thinking mode, image generation, and context caching change the TPM math

**Gemini 2.5 Pro and 2.5 Flash both support thinking mode** — the model generates internal reasoning tokens before producing the final answer. Thinking tokens are billed and they count against your TPM. A 'simple' Gemini 2.5 Pro call with a high thinking budget can consume **10-30x** the tokens of the visible prompt + response, exhausting TPM far faster than the RPM ceiling.

Practical impact: a Tier 1 account with **2M TPM** on Gemini 2.5 Pro that issues 150 RPM (the Tier 1 RPM ceiling) at an average 2,000 visible tokens per call + 8,000 thinking tokens per call = ~1.5M TPM = 75% utilization. The same workload with thinking disabled (or budget set to 0) is ~300k TPM = 15% utilization. Set `thinkingConfig.thinkingBudget` explicitly for production workloads — the default is high.

**Image generation** (Gemini 2.5 Flash Image, the native image-out mode) bills generated images as output tokens at a fixed token cost per image. A 1024×1024 image bills as ~1,290 output tokens. High-volume image generation will saturate TPM faster than RPM — plan against the TPM ceiling, not the RPM ceiling, when image generation is in the call path.

**Context caching** (the `cached_content` API) goes the other way — it *reduces* effective TPM consumption for repeated long-context calls by 50-80% in steady state. Cached input tokens bill at ~25% of standard input price and, critically, do not re-consume the cache-content's tokens against your TPM ceiling on subsequent calls. For a 500k-token system prompt called 1,000 times/day, context caching is the difference between Tier 3 throughput sufficient and Tier 3 throughput exhausted. See Google's context-caching guide for the API surface.

429s and 503s: how Gemini signals rate-limit and capacity events

When you exceed any Gemini rate-limit ceiling (RPM, TPM, or RPD), the API returns **HTTP 429** with a structured error body: `{ "error": { "code": 429, "status": "RESOURCE_EXHAUSTED", "message": "..." } }`. The `RESOURCE_EXHAUSTED` status is the canonical Google API rate-limit signal — clients (the official `google-genai` SDK and any well-built community client) should retry with exponential backoff on this status.

Distinguish 429 from **HTTP 503 UNAVAILABLE** — 503 indicates Google-side capacity shedding (the model is temporarily overloaded across all customers, independent of your tier). 503 is most common on Gemini 2.5 Pro during peak hours (US daytime), especially in the first few weeks after a model release. The retry strategy differs: 429 means your traffic shape is the problem (slow down); 503 means Google's traffic shape is the problem (retry with longer backoff and maybe fall back to Gemini 2.5 Flash).

Production retry pattern: on 429, exponential backoff at 1s / 2s / 4s / 8s / 16s with ±25% jitter, capped at 60s. On 503, exponential backoff at 5s / 15s / 45s with ±50% jitter, capped at 5 minutes, with automatic fallback to a lighter model after 2 failed retries. The official `google-genai` SDK handles retry semantics by default; if you have built a custom client, mirror the SDK's defaults rather than inventing.

**Streaming requests** can fail mid-stream with `RESOURCE_EXHAUSTED` if the response would push you over TPM during generation. The SDK surfaces this as a structured error on the stream iterator — handle it the same way as a request-time 429. Streaming does not give you any rate-limit relief; the TPM accounting is identical.

Vertex AI as the path past Tier 3 ceilings

Once you hit Tier 3 on AI Studio and still need more throughput — typically above 10k RPM on Gemini 2.5 Flash or 2k RPM on Gemini 2.5 Pro — the canonical next step is to move workload onto **Vertex AI**. Same underlying models (Gemini 2.5 Pro, Flash, Flash-Lite all available on Vertex), same per-token pricing, but a completely separate quota system that does not honor your AI Studio tier.

Vertex quotas are **per-project, per-region, per-base-model**. Default quotas at project creation are usually generous (60 RPM on Gemini 2.5 Pro in `us-central1`, comparable on Flash) but the real upside is the quota-increase process: file a quota request via console.cloud.google.com/iam-admin/quotas, filter to `aiplatform.googleapis.com`, and request the specific metric (e.g., `generate_content_requests_per_minute_per_project_per_base_model`) at the new ceiling.

Quota increases typically process in **24-72 hours** for reasonable amounts (2-5x default) and **5-10 days** for large jumps (10x+). For very large requests, the Google Cloud account team reviews and approves — having an account team relationship accelerates this materially. Enterprise customers with **Provisioned Throughput** (Google's reserved-capacity SKU) bypass the standard quota path entirely and get guaranteed RPM / TPM with SLA.

Vertex AI also unlocks compliance surface that AI Studio does not have: **VPC-SC perimeters**, **CMEK**, **HIPAA-eligible** workloads, **FedRAMP High** regions for US public sector, **EU data residency** via the `europe-*` regions. For any team where compliance is in scope, Vertex is not optional — it is the only path.

Batch API and Context Caching: the two cost-reduction levers that don't require a tier change

Google offers two production-grade cost levers on Gemini that work at every paid tier (Tier 1+) and that most teams under-use. Both materially reduce effective cost-per-task without changing the model or the tier.

**Gemini Batch API** runs asynchronous jobs at **50% off** both input and output token prices, with a 24-hour completion window. Batch jobs do not consume your real-time RPM / TPM budget — they run against a separate enqueued-token ceiling that scales by tier (Tier 1 = 5M / 3M / 10M enqueued tokens for 2.5 Pro / 2.5 Flash / 2.5 Flash-Lite respectively; Tier 3 = 1B enqueued tokens across all three models). Use Batch for evaluation runs, dataset generation, weekly classification batches, or any non-interactive workload — see Google's Batch API documentation.

**Context Caching** (the `cached_content` API) lets you upload a large reusable context (system prompt, document corpus, code repository) once and reference it from subsequent calls at **~25% of standard input pricing** for the cached portion. Cache TTL is configurable (default 1 hour, extendable). For workloads with a stable large prompt prefix called many times per hour, context caching cuts effective input cost by 60-80% and reduces TPM consumption proportionally — see Google's context-caching guide.

Combined: a team running Gemini 2.5 Pro evaluation suites against a 200k-token reference document at 5,000 prompts/day can run via Batch + cached context for roughly **15% of the cost** of the same workload run synchronously without caching. The tier ceiling becomes irrelevant; the binding constraint becomes wall-clock latency tolerance (Batch's 24-hour window) and cache hit-rate optimization.

**Priority Inference** is a third lever in the opposite direction — for latency-critical workloads, opt-in priority routing gives faster scheduling at 0.3x the standard rate-limit allowance (you trade 70% of your RPM ceiling for first-in-line scheduling on the model server). Useful for production inference where p99 latency matters more than RPM headroom.

Sourcing, live-verify checklist, and the forum-at-rank-1 problem

The tier qualification thresholds in this guide come from Google's official rate-limits documentation at ai.google.dev/gemini-api/docs/rate-limits, fetched 2026-06-20. The four-tier structure (Free, Tier 1, Tier 2, Tier 3), monetary thresholds ($250 / $1,000), and time requirements (3 days / 30 days) are stated explicitly on that page.

**Per-model RPM / TPM / RPD ceilings shown in the table** are the typical published ceilings for each tier as of June 2026, cross-referenced against the live rate-limit page at aistudio.google.com/rate-limit. Google adjusts these independently of the tier ladder — they may increase quietly as model serving capacity grows, or they may be reduced during capacity events. **For any decision that depends on the exact number, live-verify on your own account's rate-limit page** — it reflects any account-specific adjustments, enterprise overrides, and current published ceilings.

**Batch API enqueued-token ceilings** (5M-1B per tier per model) come from the official rate-limits doc directly and are reliable for capacity planning. **Priority inference's 0.3x trade-off** is also from the official doc.

**Vertex AI default quotas** (60 RPM on Gemini 2.5 Pro at project creation) come from the Vertex documentation and the GCP quota console; defaults can vary by region and by account-history factors. Always confirm in the Cloud Console quota page for your specific project and region before architecting around an assumed number.

**Why this page exists.** ChatGPT, Perplexity, and Claude routinely surface community forum posts and outdated blog posts when asked 'what are Gemini's rate limits in 2026' because Google's own documentation does not publish a single consolidated tier-by-model RPM / TPM / RPD table that covers the current 2.5 family. This page is the consolidated reference: sourced, dated, single URL, single canonical layout. If you arrived here from a citation in another AI assistant, that mechanism is working — and we recommend you click through to aistudio.google.com/rate-limit to verify the live numbers for your specific account before any production capacity decision.

Step-by-step: unlocking Gemini Tier 3 or moving to Vertex AI

1
Link a Google Cloud billing account to unlock Tier 1
Go to aistudio.google.com/api-keys → 'Set up Billing'. Link an existing GCP billing account or create a new one. Accept the paid-API terms. Your default project promotes to Tier 1 immediately — confirm at aistudio.google.com/rate-limit. This step also moves your data off the 'used for training' default that applies to Free tier.
2
Make a first paid API call to start the Tier 2 / Tier 3 clock
Run any inference call against Gemini 2.5 Flash or 2.5 Pro to capture a first billed dollar. The 3-day (Tier 2) and 30-day (Tier 3) clocks start on the date your first successful payment settles, not on signup. Verify the date on the billing dashboard at console.cloud.google.com/billing.
3
Accumulate paid usage toward the $250 (Tier 2) and $1,000 (Tier 3) thresholds
Embeddings (text-embedding-004), Batch API jobs on Gemini 2.5 Flash-Lite at 50% off, and high-volume eval runs are the cheapest ways to accumulate paid usage without putting production quality at risk. Track cumulative billed dollars on the billing console; tier promotion is automatic once both threshold and time clear.
4
Verify your tier and live rate limits before scaling production traffic
Visit aistudio.google.com/rate-limit. Confirm your tier matches expectations and check the live per-model RPM / TPM / RPD for Gemini 2.5 Pro, 2.5 Flash, and 2.5 Flash-Lite. Identify which ceiling will bind first for your workload shape (interactive → RPM, batch → TPM, eval → RPD). Instrument utilization against all three.
5
Move to Vertex AI for throughput above Tier 3 or for compliance requirements
Create a GCP project, enable the Vertex AI API, and route traffic via the google-cloud-aiplatform SDK against your chosen region (us-central1 for default Gemini 2.5 availability). File quota increases via console.cloud.google.com/iam-admin/quotas → aiplatform.googleapis.com for any specific metric you need raised. Vertex unlocks VPC-SC, CMEK, HIPAA, FedRAMP, and EU data residency that AI Studio does not offer.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

OpenAI API cost calculator→Claude API cost calculator→OpenAI Tier 5 unlock→Gemini-tuned prompt generator→

Frequently Asked Questions

Is the Gemini API free tier actually usable in 2026, or is it a trial?

It is a permanent rate-limited tier, not a 30-day trial. You can run an active project against Gemini 2.5 Flash at 10 RPM and Gemini 2.5 Pro at 5 RPM / 100 RPD indefinitely without ever adding a payment method. The trade-off: Free tier data is used to improve Google's products, so for any app handling user PII, regulated data, or proprietary content, move to Tier 1+ (which carries the data-not-used-for-training commitment).

What does it take to unlock Gemini Tier 1 in 2026?

Link an active Google Cloud billing account and accept the paid-API terms at aistudio.google.com/api-keys. There is no minimum spend, no 30-day wait, no $5 first payment requirement. Promotion to Tier 1 is immediate, with throughput on Gemini 2.5 Flash jumping from 10 RPM (Free) to 1,000 RPM, and Gemini 2.5 Pro from 5 RPM to 150 RPM.

When should I use Vertex AI instead of Google AI Studio for Gemini?

Three triggers: (1) throughput needs above what AI Studio Tier 3 provides (typically above 10k RPM on Flash or 2k RPM on Pro); (2) compliance requirements that need VPC-SC, CMEK, HIPAA, FedRAMP, or EU data residency; (3) integration with an existing GCP workload (IAM, audit logs, billing rollup into a Cloud account). Vertex uses a separate quota system — project-level and regional — that does not honor your AI Studio tier.

Which Gemini rate limit binds first — RPM, TPM, or RPD?

It depends on workload shape. Interactive / chat workloads (bursty, short contexts) typically hit RPM first. Batch / summarization workloads (sustained, long contexts) hit TPM first. Long-running evaluation jobs hit RPD first — especially on Gemini 2.5 Pro where Free is 100 RPD and Tier 1 is 10,000 RPD. Instrument all three and watch which utilization curve hits 80% earliest for your traffic.

Why is Gemini 2.5 Pro more rate-limited than 2.5 Flash or Flash-Lite?

Pro is the highest-cost, longest-thinking model and Google rations server-side capacity more tightly. As of June 2026, Gemini 2.5 Pro caps at 2,000 RPM at Tier 3 vs Gemini 2.5 Flash at 10,000 RPM and Gemini 2.5 Flash-Lite at ~30,000 RPM. For high-volume production traffic where Pro quality is not strictly required, Flash is the right default; reserve Pro for the calls where reasoning quality moves a user-visible metric.

Does Gemini's thinking mode count against my TPM ceiling?

Yes — thinking tokens are billed and counted against TPM exactly like visible output tokens. A Gemini 2.5 Pro call with a high thinking budget can consume 10-30x the tokens of the visible prompt + response. Set thinkingConfig.thinkingBudget explicitly for production workloads; the default is high enough to materially affect TPM utilization at scale.

How much does context caching reduce my Gemini bill and rate-limit pressure?

Cached input tokens bill at roughly 25% of standard input price and do not re-consume the cached content's tokens against your TPM ceiling on subsequent calls. For a 500k-token system prompt called 1,000 times/day, that is the difference between Tier 3 throughput being sufficient and Tier 3 throughput being exhausted. Combined with Batch API (50% off, separate quota), production teams routinely run at 15-25% of the naive cost of the same workload.

What HTTP response does Gemini return when I exceed a rate limit?

HTTP 429 with status RESOURCE_EXHAUSTED in the structured error body. Separate from that, HTTP 503 UNAVAILABLE signals Google-side capacity shedding (most common on Gemini 2.5 Pro during US daytime peaks). Production pattern: exponential backoff with jitter on 429 (1s/2s/4s/8s/16s, capped at 60s); longer backoff with fallback to Gemini 2.5 Flash on 503 (5s/15s/45s, capped at 5 minutes). The official google-genai SDK handles retry semantics correctly by default.

Free tier covers prototyping. Tier 3 covers production.

Gemini quotas scale with billing tier. Prompt structure determines whether you hit them. Our AI Prompt Generator writes Gemini-tuned prompts (parts + roles array, multimodal-ready) based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →