Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Azure OpenAI Quota Management 2026: TPM, PTU, Regional Caps & Increase Requests

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

**Azure OpenAI quota is not the same product as OpenAI direct rate limits.** OpenAI's tier ladder is account-level — once you hit Tier 5 ($1,000 paid + 30 days), every model and every region inherits the same RPM/TPM ceiling. Azure flips every assumption: quota is scoped per **Azure subscription**, per **region**, per **model**, per **deployment type (SKU)**, and is now allocated under a separate **Quota Tier** system (Tier 0 through Tier 6) introduced in late 2025 to replace the older Default/Enterprise binary. As of June 2026, an Enterprise Agreement customer in Tier 4 has dramatically more headroom than a credit-card subscription in Tier 1 — even at the same workload.

The trade for that complexity is the **deployment SKU menu**: five flavors that let you trade regional residency for throughput. **Standard** keeps inference in a single Azure region. **Global Standard** dynamically routes across Azure's global footprint for the best capacity. **Data Zone Standard** stays inside a geographic zone (US or EU) — the compromise SKU when you need residency but not single-region pinning. **Provisioned (PTU-Managed, Data Zone Provisioned, Global Provisioned)** trades the per-token meter for a flat hourly commit on dedicated capacity — the right answer for latency-critical or sustained-high-volume production. The OpenAI direct API has none of these knobs.

Below is the canonical reference: the default Quota Tier 1 TPM/RPM allocations for **gpt-5.5**, **gpt-5.4**, and the broader family across deployment SKUs (sourced from Microsoft Learn's Azure OpenAI quotas and limits page fetched 2026-06-20), the quota increase request flow, PTU sizing math and hourly billing, Dynamic Quota and spillover, and the migration decision between OpenAI direct and Azure. For the OpenAI direct side of the comparison see OpenAI Tier 5 unlock requirements; for cost modeling, the OpenAI API cost calculator.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Azure OpenAI default quotas by deployment type — June 2026 (Quota Tier 1, gpt-5.5)

Feature
Default TPM (gpt-5.5)
Default RPM
Max via increase request
Standard (regional)Not GA in Tier 1*Not GA in Tier 1*Form-gated
Global Standard0 TPM (Tier 1)† / 10M TPM (Tier 5)0 RPM (Tier 1)† / 10,000 RPM (Tier 5)Form-gated, EA priority
Data Zone Standard (US/EU)0 TPM (Tier 1)† / 3M TPM (Tier 5)0 RPM (Tier 1)† / 3,000 RPM (Tier 5)Form-gated
PTU-Managed (Regional Provisioned)Granted in PTU units, not TPMTPM derived from PTU count × per-model ratioUp to 100,000 PTU per deployment
Global Provisioned (Managed)Granted in PTU units, not TPMTPM derived from PTU count × per-model ratioUp to 100,000 PTU per deployment

Source, as of 2026-06-20: Microsoft Learn — Azure OpenAI quotas and limits (https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits), Quota Tier reference. *Standard (single-region) gpt-5.5 isn't listed in Microsoft's Quota Tier 1 reference; gpt-5.4-mini Standard is available at 6,000 TPM Tier 1. †gpt-5.5 Tier 1 default is **0 TPM / 0 RPM** across both Global Standard and Data Zone Standard — the model requires manual capacity allocation via the quota increase form on a Tier 1 subscription. Tier 5 (highest non-Enterprise tier) defaults rise to 10M TPM / 10,000 RPM Global Standard and 3M TPM / 3,000 RPM Data Zone Standard. EA / MCA-E subscriptions typically start in a higher tier with non-zero gpt-5.5 allocations. Always verify against [ai.azure.com/resource/quota](https://ai.azure.com/resource/quota) for your live subscription numbers.

Azure OpenAI vs OpenAI direct — why the quota systems aren't comparable

**OpenAI direct** rate-limits accounts on a single global tier ladder (Free → Tier 5) tied to cumulative paid usage and days-since-first-payment. The same Tier 5 account inherits the same RPM/TPM ceilings on every model in every region. Promotion is automatic; the only knobs are 'wait' or 'sign an enterprise contract'.

**Azure OpenAI** rate-limits *subscriptions* on a per-region, per-model, per-deployment-type basis. The same subscription can have **5,000,000 TPM on gpt-4.1 Global Standard in East US** and simultaneously **300,000 TPM on gpt-4.1 Data Zone Standard in Sweden Central** — no inheritance between them. The throughput you can deliver is the sum across all subscriptions × regions × deployment types your application is wired to.

Microsoft introduced **Quota Tiers** in late 2025 — seven tiers (Tier 0 through Tier 6) that scale automatically with consumption, layered on top of the per-region/per-model/per-SKU scoping. A Tier 1 subscription with gpt-5.5 sees **0 TPM defaults** until it requests allocation; a Tier 5 subscription sees **10M TPM Global Standard** out of the box. Enterprise Agreement (EA) or MCA-E customers get assigned to higher tiers based on Microsoft relationship status, not just consumption.

**Practical implication**: on OpenAI direct, capacity is bound to who you are. On Azure, capacity is bound to where (region) × what (model) × how (SKU) × who (tier). Architecting around Azure means picking the SKU and region mix that fits your needs, not waiting on a single tier promotion.


The per-subscription / per-region / per-model scoping model

Quota is allocated in **Tokens Per Minute (TPM)** with a corresponding **Requests Per Minute (RPM)** ceiling enforced proportionally. Per Microsoft Learn: 'Tokens per minute (TPM) and requests per minute (RPM) limits are defined per region, per subscription, and per model or deployment type.' A single subscription can stack quota across regions — there's no single global cap the way there is on OpenAI direct.

**Worked example**: a Tier 1 subscription with gpt-5.1 has **10,000 RPM / 1,000,000 TPM Global Standard** per region. Deploy in East US, Sweden Central, and UK South — **30,000 RPM / 3,000,000 TPM total**, same subscription, no increase request needed. Multi-region routing (Azure Front Door, custom load balancer) is the cleanest pattern for teams hitting single-region ceilings.

Each **resource** within a subscription/region pair gets its own deployments — up to **30 Azure OpenAI resources per region per subscription**, up to **32 standard deployments per resource**. TPM allocated to a deployment subtracts from the regional pool; deleting a deployment frees it back (with a **48-hour purge delay** if the resource is deleted via REST API without first deleting deployments).

**The RPM-to-TPM ratio is model-specific.** Older chat models: **6 RPM per 1,000 TPM**. o1/o1-preview: **1 RPM per 6,000 TPM**. o3 and o4-mini: **1 RPM per 1,000 TPM**. o3-mini / o1-mini / o3-pro: **1 RPM per 10,000 TPM** (source). The quota form accepts one 'capacity units' number; RPM and TPM derive from it. Scripts trying to set RPM and TPM independently will silently misallocate.


Deployment SKUs explained: Standard, Global Standard, Data Zone, PTU-Managed, Provisioned

Azure OpenAI exposes five SKUs that trade residency, throughput, and billing predictability. Pick wrong and you're either burning money on stranded capacity or chasing 429s through support.

**Standard (regional)**: inference stays in the deployed region. Highest residency control, lowest throughput ceilings. Available on a subset of models — gpt-4.1-mini Standard is **6,000 TPM Tier 1 → 150,000 TPM Tier 5**; gpt-5.1 Standard is **300,000 TPM Tier 1 → 3,000,000 TPM Tier 5**. Use when single-region residency is a regulatory requirement.

**Global Standard**: routes dynamically across Azure's global footprint for best capacity. Highest throughput at lowest cost; trades single-region residency for elasticity. **gpt-5.1 Global Standard** Tier 1: **1,000,000 TPM / 10,000 RPM**. Tier 5: **10,000,000 TPM / 100,000 RPM**. Default for any workload without strict residency.

**Data Zone Standard (US or EU)**: routes within a Microsoft-defined zone. Compromise SKU for 'US data stays in US' or 'EU data stays in EU' with elastic intra-zone routing. **gpt-5.1 Data Zone Standard** Tier 1: **300,000 TPM / 3,000 RPM**. Tier 5: **3,000,000 TPM / 30,000 RPM**. Roughly 3x single-region Standard, half of Global Standard.

**PTU-Managed (Regional Provisioned, `ProvisionedManaged`)**: dedicated single-region capacity in **PTUs** instead of TPM. Flat hourly billing, capacity reserved for your deployment. For latency-sensitive production. **Global Provisioned (`GlobalProvisionedManaged`) and Data Zone Provisioned (`DataZoneProvisionedManaged`)** are the provisioned versions of their Standard cousins. Max PTU per deployment: **100,000**.

**Picking the SKU**: dev/test → Global Standard. Production, no residency → Global Standard until shared-pool 429s, then add PTU-Managed for the floor with **spillover** to Global Standard for bursts. Production, US-only or EU-only → Data Zone Standard (Data Zone Provisioned for latency SLA). Strict single-region or HIPAA / FedRAMP → Standard or Regional Provisioned in a compliance-approved region.


PTUs explained: what 1 PTU buys, sizing math, when to switch from pay-as-you-go

A **Provisioned Throughput Unit (PTU)** is Azure OpenAI's unit of dedicated model processing capacity. Per Microsoft's definition: 'A PTU represents a fixed amount of model processing capacity. Foundry reserves that amount of compute and holds it for your deployment.'

**Three properties that surprise new PTU teams.** PTUs are **model-independent at the quota level** — the same PTU pool deploys any supported model, but TPM-per-PTU varies (gpt-5.1 needs more PTUs than gpt-5.1-nano for equivalent TPM). PTUs are **region-specific** — East US quota doesn't help in West Europe. And **having PTU quota doesn't guarantee capacity** — capacity is a separate, dynamically-changing resource allocated at deployment time. Approved quota + zero regional capacity = deployment failure.

**Minimum PTU commitments** are per-model (typical 15-50 PTUs for chat, higher for reasoning models — see the per-model parameters table). **Maximum per deployment: 100,000 PTUs**. Sizing requires expected RPM, average input + output token size, and **prompt cache hit rate** (cached tokens don't consume PTU capacity, so high cache rates linearly reduce required PTUs).

**Pay-as-you-go vs PTU break-even.** Hourly PTU billing fits benchmarks and short-term scale events. For sustained production, **Azure Reservations** (1-month or 1-year) discount the $/PTU/hr meter by roughly **30-65%** depending on SKU and term. Rule of thumb: if your steady-state utilization on Standard exceeds **~50% of equivalent PTU capacity for 12+ hours/day across 5+ days/week**, Reservation math beats per-token billing. Below that, stay on Global Standard with spillover to a small PTU floor for latency-critical traffic. Microsoft explicitly warns against scaling PTUs with traffic: capacity may not be available on scale-back-up, and continuous hourly billing at high utilization exceeds Reservation pricing.


Requesting an Azure OpenAI quota increase — the form, the inputs, the turnaround

Quota increase requests go through one form: **aka.ms/oai/stuquotarequest** — also accessible from the Foundry portal Quota page. Covers OpenAI models sold by Azure, Foundry Models sold directly by Azure, and Anthropic models on Azure. (Other partner/community models don't support quota increases.)

**Form inputs**: subscription ID, target region(s), model(s), deployment type, requested TPM or PTU count, current TPM and utilization, business justification, and usage projections for large requests. Microsoft's **'priority goes to customers who actively use their existing quota allocation'** clause is real — requests from subscriptions at <20% of current quota are commonly denied or down-scoped. Hit 80%+ utilization before requesting more, or be ready to explain the spike.

**Turnaround** (no published SLA): **1-3 business days** for typical mid-size increases on EA / MCA-E subscriptions, **5-10 business days** for credit-card-billed subscriptions, **same-day** for emergency PTU capacity asks from named-account enterprise customers. Approvals come via email; the increase shows up on the Foundry portal Quota page.

**Quota Tier auto-upgrade as an alternative.** Microsoft auto-promotes subscriptions between Quota Tiers based on consumption and relationship status — no form needed. If usage is genuinely growing, auto-upgrade often arrives before a form-based increase clears. Check current tier programmatically via the **quotaTiers** control plane API (`GET .../providers/Microsoft.CognitiveServices/quotaTiers?api-version=2025-10-01-preview`); opt out by patching `tierUpgradePolicy: 'NoAutoUpgrade'` if you use quota as a billing throttle.


Regional capacity strategy: which regions have most spare gpt-5.5 capacity

Azure OpenAI capacity is finite and dynamically changing. For Global Standard and Data Zone Standard, Azure handles routing. For Standard and Provisioned regional, you pick — and the choice determines whether deployment succeeds.

**Broadest gpt-5.x family availability as of June 2026**: **East US**, **East US 2**, **South Central US**, **West US 3** (US); **Sweden Central**, **Switzerland North**, **France Central**, **West Europe**, **UK South** (Europe); **Japan East**, **Australia East** (APAC). Newer models (gpt-5.5, gpt-5.4-pro) launch in East US 2 and Sweden Central first; older models (gpt-4o family, o-series) have wider footprint.

**Programmatic capacity discovery.** The **Model Capacities API** returns available deployment capacity by location and SKU: `GET .../providers/Microsoft.CognitiveServices/modelCapacities?api-version=2024-10-01&modelFormat=OpenAI&modelName=gpt-5.5&modelVersion={version}`. The `availableCapacity` field is the actual deployable headroom — not just quota approval. Pre-check before deployment.

**Capacity-shifting pattern.** Mission-critical teams deploy across 3-4 regions of the same data zone (e.g., East US + East US 2 + South Central US for US-only) with health checks and a routing layer that shifts traffic on 429s or latency. Cost: 3-4x deployment management overhead. Benefit: surviving a single-region capacity squeeze without an incident.


Dynamic Quota, Spillover, and shared-pool 429s

**Spillover** is the cleanest Azure-native pattern for provisioned-deployment overflow. When a PTU deployment is saturated and returns 429, **spillover automatically reroutes overflow to a paired Standard deployment in the same Foundry resource**. Configurable globally or per-request via `x-ms-spillover-deployment` header. Available on all Azure OpenAI Foundry Models that support PTU (not yet DeepSeek or Meta Llama). The pattern: small PTU for the steady-state floor → spillover to Global Standard for bursts → both billed only for what they serve.

**Shared-pool throttling on Standard deployments.** Microsoft's docs are unusually direct: 'Standard (pay-as-you-go) deployments share a common resource pool across customers. When demand approaches capacity limits, the system might temporarily reduce the effective rate limit for that deployment.' You'll see `x-ratelimit-limit-tokens` reporting *lower than your configured TPM* during the adjustment. It typically resolves within hours — escalating through the increase form won't help. Spillover or PTU is the architectural answer.


Provisioned Throughput billing math: hourly commit vs per-token

Provisioned deployments bill at a flat **$/PTU/hr** rate regardless of token volume. The meter starts at deployment creation, stops only on deletion — scaling a PTU deployment to zero isn't a thing (delete + re-create when needed, with no capacity guarantee on the re-create).

**Break-even calculation.** Worked example for gpt-5.1 with a hypothetical 50 PTU deployment delivering ~2.5M TPM steady-state: at $1/PTU/hr × 50 PTUs × 24h × 30d = **$36,000/month**. At the same 2.5M TPM on Global Standard pay-per-token, you'd consume ~108 billion tokens/month — **multiples of $36k** at typical gpt-5.1 rates. Provisioned wins decisively at high sustained utilization; loses badly at low utilization (paying $36k/mo for idle capacity).

**Azure Reservations** (1-month or 1-year) discount the PTU meter further. They are **financial discounts applied to the meter, not to specific deployments** — one Reservation scoped to a subscription or resource group applies against any matching PTU consumption inside that scope. Critical: **Reservations don't guarantee capacity**. Per Microsoft: 'First create deployments to confirm that capacity is available, then purchase the reservation to lock in the discounted rate.' Buying a 1-year Reservation for a region/SKU with no capacity is a paid mistake.

Track PTU economics via Cost Management — amortized billing, reservation utilization, chargeback tools at the Cost Management docs. Reservation utilization <80% for two consecutive months is leaving discount on the table.


Azure vs OpenAI direct (Tier 5) — the migration decision

The OpenAI-direct-vs-Azure decision pivots on three axes: **compliance**, **scale economics**, and **operational complexity**. Per-token pricing is nearly identical between the two paths, so the choice rarely turns on raw cost.

**Compliance**: Azure OpenAI is the only path for **HIPAA**, **FedRAMP High**, **DoD IL5**, **ITAR**, **CJIS**, and most country-specific data residency (EU, UK, Australia, Japan via regional deployments). It also fits cleanly inside an existing Azure enterprise security review — most large-enterprise security teams in 2026 will not pass an OpenAI-direct integration through procurement, but will pass an Azure deployment inside the customer's existing Azure tenant. If any of those compliance bars apply, Azure is the path, full stop.

**Scale economics**: above ~$50k/month in OpenAI direct spend, Azure's PTU + Reservation pricing typically beats per-token rates by 15-35% on equivalent throughput — at the cost of capacity-planning rigor. Below ~$10k/month, OpenAI direct's lower operational overhead usually wins. The middle band ($10k-$50k) is judgment: with an Azure billing relationship and an SRE team, lean Azure; small team chasing fast iteration, stay on OpenAI direct.

**Operational complexity**: OpenAI direct is one API key, one tier, one ladder. Azure is N subscriptions × M regions × K SKUs × Quota Tiers. Multi-region deployment, capacity-aware routing, PTU sizing, Reservation management, Foundry resource lifecycle all become operational concerns. Teams that can't staff that complexity end up with stranded PTU capacity or unplanned 429s — wiping out the savings that motivated the migration.

**Hybrid pattern**: many production teams run both — Azure for compliance-bound or sustained high-volume traffic (PTUs + spillover to Global Standard), OpenAI direct for experimental work where iteration speed beats operational tidiness. The two paths share model names and (mostly) identical APIs — porting is a base-URL change, not a rewrite.


What counts toward Azure OpenAI quota (and what doesn't)

**TPM enforcement uses estimated maximum tokens at request time**, not billed tokens. Estimate includes prompt + `max_tokens` + `best_of`. `max_tokens=4000` for a 200-token response consumes 20x more TPM budget than you'll be billed. Tune `max_tokens` to expected size. Failed requests also count — 400s and shadow-throttled requests still hit TPM. Always respect `retry-after-ms` — retry storms worsen throttling.

**Batch API runs on separate quota.** Global Batch and Data Zone Batch use independent 'enqueued tokens' that don't compete with real-time TPM/RPM. EA/MCA-E: **5B-15B enqueued tokens**; default: **200M-1B**; credit-card: **50M cap**. Use Batch for any async job tolerating 24-hour completion.

**Cache-hit tokens don't consume PTU capacity** on Provisioned. A 70% cache hit rate effectively triples PTU throughput. On Standard, cached tokens are billed at a discount but still count toward TPM. The PTU cache advantage is unique to Provisioned — strongest argument for prompt structure that front-loads stable content.


Sourcing, live-verify, and what changes after this page

All quota and limit numbers come from Microsoft Learn, fetched **2026-06-20**. Primary sources: Azure OpenAI quotas and limits (updated 2026-06-05), Manage Azure OpenAI quota, and Provisioned throughput for Foundry Models.

**Verifiable from those sources**: Quota Tier 1-6 reference tables (RPM/TPM per model × SKU), deployment SKU definitions, per-model RPM-to-TPM ratios, spillover behavior, Reservation economics, **gpt-5.5 Tier 1 default of 0 TPM / 0 RPM** requiring manual allocation, and the 100,000 PTU per-deployment max.

**Directional, not exact**: PTU $/hr rates (pull from your EA pricing sheet or the Azure pricing calculator), quota increase turnaround (community-reported, not an SLA), exact regional capacity at any moment (use the Model Capacities API).

**Live-verify checklist**: (1) check your Quota Tier via the `quotaTiers` REST API or Foundry portal Quota page; (2) check consumption vs limits via the Usages API; (3) check available capacity via the Model Capacities API; (4) run at >80% of current allocation for a week before requesting an increase — priority goes to active utilizers.

**Why this canonical page exists**: AI engines routinely cite incomplete Stack Overflow threads on Azure OpenAI quota — the official docs are correct but heavy, and the Quota Tier system is recent enough that pre-2025 content is wrong. This page is the clean single-URL reference. If you arrived from a ChatGPT or Perplexity citation, the mechanism is working.

Step-by-step: requesting an Azure OpenAI quota increase

  1. 1

    Check your current Quota Tier and consumption first

    Open the Foundry portal Quota page and filter to the model and region you want to scale. Note the configured TPM/RPM limit and the current consumption. If you're below ~50% utilization, the increase request is likely to be denied or down-scoped — Microsoft prioritizes 'customers who actively use their existing quota allocation'. Either grow into the existing quota first, or be ready to attach usage projections (growth chart, planned launch, etc.) to the form.

  2. 2

    Verify capacity exists in your target region(s)

    Run the Model Capacities API against your subscription for the target model + version: `GET .../providers/Microsoft.CognitiveServices/modelCapacities?api-version=2024-10-01&modelFormat=OpenAI&modelName=gpt-5.5&modelVersion={ver}`. The `availableCapacity` field per location × SKU is the actual deployable headroom. Requesting quota in a region with zero available capacity gets approved on paper but blocks at deployment time.

  3. 3

    Open the quota increase request form

    Navigate to aka.ms/oai/stuquotarequest (also linked from the Foundry portal Quota page). Fill in: subscription ID, target region(s), specific model(s) and version(s), deployment type (Standard / Global Standard / Data Zone Standard / Provisioned), requested TPM or PTU count, current TPM and utilization, business justification (~3-5 sentences on the workload and why current quota isn't enough), and traffic projections if requesting >2x current allocation.

  4. 4

    Submit and monitor email + portal for approval

    Approvals come via email to the billing/account contact. Typical turnaround: 1-3 business days for EA / MCA-E subscriptions, 5-10 business days for credit-card-billed subscriptions, same-day for urgent PTU asks from named-account customers. Once approved, the new limit reflects on the Foundry portal Quota page within minutes.

  5. 5

    Allocate the new quota to specific deployments

    Approved quota lives at the subscription/region/model level — it doesn't automatically flow to existing deployments. Go to **Management → Model quota** in the Foundry portal and assign TPM (in 1,000-unit increments) to each deployment that needs it. Quota allocated to a deployment is what gets enforced as TPM/RPM at the API edge; quota sitting unallocated in the pool is dormant.

Frequently Asked Questions

How is Azure OpenAI quota different from OpenAI direct rate limits?

Three structural differences. **Scoping**: OpenAI direct is account-level (one tier applies everywhere); Azure is per-subscription, per-region, per-model, per-deployment-type. **Tier ladder**: OpenAI uses Free → Tier 5 based on cumulative paid usage; Azure uses Quota Tier 0 → Tier 6 based on consumption + Microsoft relationship status (EA / MCA-E customers get assigned higher tiers automatically). **Increase path**: OpenAI tier promotion is automatic and time-gated (30 days for Tier 5); Azure increases require a form submission via aka.ms/oai/stuquotarequest with 1-10 business day turnaround.

What is a PTU (Provisioned Throughput Unit) and what does 1 PTU buy?

A PTU is Azure OpenAI's unit of dedicated model processing capacity. One PTU represents a fixed amount of compute reserved exclusively for your deployment, billed hourly regardless of token consumption. The TPM each PTU delivers varies by model — heavier models like gpt-5.1 require more PTUs to serve the same TPM as a lighter model like gpt-5.1-nano. See the per-model deployment parameters table for exact PTU-to-TPM ratios per model. Minimum PTU commitments are model-specific (typically 15-50 PTUs); maximum is 100,000 PTUs per deployment.

Can I share Azure OpenAI quota across regions?

No. Quota is allocated per region per subscription per model per deployment type. East US quota does not apply to West Europe. But you can **stack** capacity across regions: deploy gpt-5.1 in 3 US regions and you have 3x the regional ceiling for the same subscription. Multi-region routing in your application layer is the cleanest pattern for teams that need throughput above any single region's quota ceiling.

Why did my Azure OpenAI deployment fail with a quota error even though I have quota approved?

Quota and capacity are separate concepts. **Quota** is the policy limit Azure approved for your subscription. **Capacity** is the actual deployable headroom in a specific region × SKU pair at deployment time — and it changes dynamically based on global demand. Having approved PTU quota but no regional capacity returns a deployment failure with a 'try another region' suggestion. Use the Model Capacities API to check `availableCapacity` per location before deploying.

Should I use Global Standard or Data Zone Standard deployment?

**Global Standard** gives the highest throughput at the lowest cost by routing inference dynamically across Azure's global footprint — choose this when you have no data residency requirements. **Data Zone Standard** restricts routing to a Microsoft-defined zone (US or EU) — choose this when you need US-only or EU-only residency but still want elastic routing within that zone. Tier 5 ceilings: gpt-5.1 Global Standard at 10M TPM / 100K RPM, Data Zone Standard at 3M TPM / 30K RPM. Single-region Standard is the strictest residency option but has the lowest throughput ceilings.

When should I switch from pay-as-you-go to Provisioned (PTU) billing?

Three triggers. **Latency**: pay-as-you-go uses a shared resource pool and is subject to temporary throttling during high demand; PTU guarantees consistent latency. **Sustained high utilization**: rule of thumb — if your steady-state TPM equals 50%+ of equivalent PTU capacity for 12+ hours/day across 5+ days/week, PTU + an Azure Reservation (1-month or 1-year, 30-65% discount vs hourly) beats per-token billing. **Mission-critical SLAs**: PTU deployments have a defined latency target per model; Standard deployments don't.

How long does an Azure OpenAI quota increase request take?

No published SLA. Community-reported turnaround across 2025-2026: **1-3 business days** for typical mid-size increases on EA / MCA-E subscriptions, **5-10 business days** for credit-card-billed subscriptions, **same-day** for emergency PTU capacity asks from named-account enterprise customers. Microsoft's docs state 'priority goes to customers who actively use their existing quota allocation' — submitting at <50% utilization commonly results in denial or down-scoped approval.

Do failed requests count toward Azure OpenAI quota?

Yes. Rate-limit enforcement is calculated at request time on estimated maximum tokens (prompt + `max_tokens` parameter), not billed tokens after response. A 400 Bad Request that never reaches the model still counts against TPM. A 429 throttled response still counts. Retry storms make throttling worse — always respect the `retry-after-ms` response header. The single biggest TPM waste in production code: setting `max_tokens=4000` for responses that average 200 tokens (20x rate-limit budget consumption per request).

Azure unlocks compliance + scale. Tight prompts unlock both for less.

PTUs are billed by the hour, not the token — under-utilized capacity is wasted spend. Our AI Prompt Generator writes GPT-5-tuned prompts (cache-friendly, capped output, batch-ready) based on YOUR business + task, so each PTU does more work. 14-day free trial, no card.

Browse all prompt tools →