By The DDH Team · Digital Dashboard Hub

OpenAI Embeddings Rate Limits 2026: Tiers, Batch API, and Matryoshka Dimensions

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

OpenAI's text embedding models are the most widely used embedding source for production RAG systems, and their rate limits are the most common bottleneck for teams doing bulk document indexing. The limits operate on three independent axes — requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD) — and which axis constrains you depends on your request size and batching strategy. A team sending many small requests will hit RPM first; a team sending large batches will hit TPM; a team running continuous high-volume indexing may eventually hit TPD. Understanding which constraint applies to your workload determines your optimization strategy.

OpenAI offers two embedding models as of June 2026 with actively supported status: text-embedding-3-small at $0.02 per million tokens and text-embedding-3-large at $0.13 per million tokens. Both models support Matryoshka representation learning, which allows truncating the output vector to a smaller dimension count without re-training — a feature that meaningfully affects storage costs and retrieval speed for large indexes. The embeddings cost calculator can help you compare total cost across these models and dimensions for your expected token volume. For a comparison of OpenAI's embeddings against alternatives from Cohere and Voyage, see the Cohere vs Voyage vs OpenAI embeddings comparison.

The most impactful rate limit feature that many developers overlook is the Batch API. For bulk indexing workloads — initial corpus ingestion, periodic re-indexing, or offline document processing — the Batch API provides 50% pricing discount and operates on a separate quota from the synchronous API, with a 24-hour SLA instead of synchronous response times. This means that a team saturating their synchronous TPM limit can continue bulk indexing through the Batch API without affecting their real-time embedding capacity. The OpenAI Batch API limits reference covers the Batch API quotas in detail. All rate limit figures in this document are sourced from platform.openai.com/docs/guides/rate-limits and openai.com/api/pricing, verified June 2026.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

OpenAI Embedding Model Rate Limits by Tier (June 2026)

Feature	Tier	RPM	TPM	TPD
Tier 1	3,000	1,000,000	3,000,000	$5 spent OR 7 days since first payment
Tier 2	5,000	10,000,000	Not published	$50 spent
Tier 3	5,000	50,000,000	Not published	$100 spent
Tier 4	10,000	100,000,000	Not published	$250 spent
Tier 5	10,000	250,000,000	Not published (or unlimited)	$1,000 spent
Batch API	Separate quota	Separate quota	No published cap	50% price discount on all tiers

Sources: platform.openai.com/docs/guides/rate-limits, openai.com/api/pricing. Limits shown are for text-embedding-3-small and text-embedding-3-large; these models share the same RPM/TPM limits. Tier 1 TPD of 3,000,000 tokens applies specifically to embedding models. Tier 2-5 TPD figures not consistently published by OpenAI as of June 2026 — verify in the platform rate limits dashboard for your account. Promotion thresholds are cumulative spend since account creation. All figures verified June 2026.

What RPM, TPM, and TPD Mean — and Which One Is Actually Constraining You

OpenAI enforces three independent rate limit dimensions for embedding API calls. RPM (requests per minute) limits how many API calls you can make regardless of their size. TPM (tokens per minute) limits the total token volume across all requests in a rolling one-minute window. TPD (tokens per day) limits total token consumption in a 24-hour period. When you hit any one of these limits, the API returns a 429 Too Many Requests error with a response header indicating which limit was exceeded and when it resets.

The binding constraint in practice depends on your batching strategy. The OpenAI embeddings API accepts batches of up to 2048 input strings in a single request. A developer sending one string per request with frequent calls will hit RPM before TPM — at 3,000 RPM with an average of 100 tokens per request, they max out at 300,000 tokens per minute, well below the 1,000,000 TPM ceiling. The same developer batching 200 strings per request (each ~100 tokens) would need only 15 requests per minute to generate 300,000 tokens per minute, using 0.5% of their RPM but 30% of their TPM. In general, maximizing batch size is the right strategy to avoid RPM limits — the embedding API is designed for batch inputs, and single-string requests are inefficient both in cost and quota utilization.

TPD is the most relevant constraint for teams doing large initial corpus indexing runs on Tier 1 accounts. At 3,000,000 tokens per day and an average document chunk size of 300 tokens, a Tier 1 account can index approximately 10,000 document chunks per day through the synchronous API. This is meaningful: a corpus of 100,000 chunks would take 10 days to index at Tier 1 without the Batch API. Teams that need to accelerate this either need to upgrade their tier (which requires spending to reach Tier 2 and above) or use the Batch API, which has a separate daily quota. The Batch API approach is almost always the right answer for initial indexing workloads, as discussed in the Batch API section below.

Tier 1 Through Tier 5: The Full Rate Limit Progression

OpenAI's tier system is based on cumulative account spend rather than a subscription tier you select. New accounts with a verified payment method start at Tier 1 after their first $5 spend (or 7 days after the first payment, whichever comes first). Tier 2 requires $50 in total spend; Tier 3 requires $100; Tier 4 requires $250; Tier 5 requires $1,000. Tier promotion happens automatically without action required from the developer — the platform checks spend thresholds and upgrades limits accordingly. You can see your current tier and specific rate limits in the platform rate limits dashboard at platform.openai.com/settings/organization/limits.

The most significant limit jump for embedding workloads is the TPM increase from Tier 1 (1,000,000 TPM) to Tier 2 (10,000,000 TPM) — a 10x increase. This corresponds to a 10x increase in sustainable embedding throughput through the synchronous API. A team that genuinely needs to index millions of documents quickly and cannot use the Batch API for their use case should prioritize reaching Tier 2 (by spending $50 across any OpenAI API usage, not specifically embeddings) early in their development cycle. Note that because tier promotion is based on total account spend, teams using the API for other purposes (completions, chat, image generation) will advance through tiers faster than a team using only cheap embedding calls.

Tier 4 and Tier 5 represent the limits relevant for production applications with substantial real-time embedding demand — embedding queries as users submit them in a live application. At 10,000 RPM and 100-250M TPM, a Tier 5 account can sustain roughly 100 million embedding operations per day through the synchronous API (at average 250-token inputs). Very few applications outside large enterprise search or recommendation systems approach this scale. Teams operating at this volume should also be in contact with OpenAI's enterprise team to discuss custom rate limits, which are available beyond Tier 5 thresholds for qualifying accounts.

The Batch API: 50% Off and a Separate Quota — the Throughput Escape Hatch

The Batch API is the most underutilized feature for teams doing bulk embedding workloads. Batch API calls are submitted as a JSONL file of individual API requests, processed asynchronously with a 24-hour completion SLA, and priced at 50% of the synchronous API rate. text-embedding-3-small at $0.02/1M tokens drops to $0.01/1M tokens through the Batch API. text-embedding-3-large at $0.13/1M tokens drops to $0.065/1M tokens. For initial corpus indexing runs that might process hundreds of millions of tokens, this 50% reduction represents meaningful savings.

More importantly for throughput-constrained teams, the Batch API operates on its own quota that is separate from the synchronous API quotas. This means that running a large Batch API indexing job does not consume any of your RPM or TPM budget for the synchronous embedding API. A production system can process real-time user query embeddings through the synchronous API at full Tier 1 limits while simultaneously running a Batch API job to index new documents without either workload affecting the other. The Batch API supports up to 50,000 input strings per file upload, and there is no documented limit on the number of batch jobs you can queue simultaneously.

The 24-hour SLA means the Batch API is not appropriate for interactive or real-time embedding needs. However, the actual completion time is typically faster than the SLA — many batch jobs complete within minutes to a few hours, depending on current system load and file size. For document indexing pipelines that run nightly or on a schedule, the 24-hour SLA is almost never a practical constraint. The workflow is: prepare a JSONL file of embedding requests (each JSON line contains the model, input text, and a custom ID for your tracking), submit via the /v1/batches endpoint, poll the batch status until complete, then download the results file. The OpenAI Batch API limits reference documents the specific file size limits and status polling behavior.

Matryoshka Dimensions: What They Are and Why They Affect Cost

Matryoshka Representation Learning (MRL) is a training technique that produces embedding models where shorter vector prefixes retain meaningful semantic information. In practical terms, it means you can truncate a text-embedding-3-large vector from its full 3072 dimensions down to 256, 512, or 1024 dimensions and still retrieve semantically relevant results — at reduced quality compared to the full vector, but often at acceptable quality for the application. text-embedding-3-large supports truncated dimensions of 256, 512, 1024, and 3072. text-embedding-3-small supports 512 and 1536.

The cost impact of Matryoshka truncation operates on the storage side, not the API billing side. OpenAI charges for tokens processed (input text tokens), not for output vector dimensions. Whether you request a 256-dim or a 3072-dim output from text-embedding-3-large, the API cost is identical for the same input text. The savings come from storing smaller vectors in your vector database, which reduces RAM requirements in Qdrant, reduces storage costs in Pinecone serverless, and reduces memory pressure in self-hosted solutions. A 256-dim float32 vector occupies 1 KB instead of 12 KB for a full 3072-dim vector — a 12x reduction. At one million vectors, this translates from 12 GB to 1 GB of pure vector storage.

The recall quality trade-off from Matryoshka truncation is model and task dependent. OpenAI's benchmarks show that text-embedding-3-large at 256 dimensions still outperforms text-embedding-ada-002 at its full 1536 dimensions on the MTEB benchmark, which provides a useful baseline for teams evaluating truncation. However, MTEB benchmark performance does not always correlate to performance on your specific domain or task. The recommended evaluation approach is to take a representative sample of your actual queries, generate embeddings at both full and truncated dimensions, run nearest-neighbor searches in your vector database with both, and measure overlap in the top-k results. If the top-5 results overlap at 85% or more between full and truncated dimensions, the quality loss is likely acceptable for RAG applications. If overlap is lower, you need more dimensions — or the underlying embedding model may not be a good fit for your domain. The embedding model leaderboard tracks quality benchmarks across models and dimension settings.

Migrating from text-embedding-ada-002: The Deprecation Timeline and Migration Path

text-embedding-ada-002 (ada-002) is OpenAI's previous-generation embedding model, released in late 2022 and widely used for early RAG systems. OpenAI has indicated that ada-002 is deprecated in favor of text-embedding-3-small and text-embedding-3-large, and while they have not announced a hard shutdown date as of June 2026, teams should assume that continued availability is not guaranteed and that the risk of unplanned deprecation increases over time. OpenAI has historically provided advance notice before model deprecations, but waiting for the announcement before beginning migration puts your system in a reactive posture.

The migration is not a simple model name swap. ada-002 produces 1536-dimensional vectors. text-embedding-3-small also produces 1536-dimensional vectors by default, but the vector representations are not compatible — you cannot query text-embedding-3-small embeddings against an index populated with ada-002 embeddings, as the semantic geometry is different. A full re-embedding of your corpus is required when migrating to either text-embedding-3-small or text-embedding-3-large. This means: new index creation in your vector database, re-processing every document through the new model, verifying result quality, and cutover. For teams with large corpora, the Batch API is the cost-efficient path for re-embedding — at $0.01/1M tokens through the Batch API for text-embedding-3-small, re-embedding 10 million tokens costs $0.10.

The quality case for migrating is strong independent of deprecation risk. text-embedding-3-small outperforms ada-002 on the MTEB benchmark while costing 5x less ($0.02/1M vs ada-002's $0.10/1M at the time ada-002 was current pricing). text-embedding-3-large provides the highest quality OpenAI embedding available at $0.13/1M. For most RAG applications, text-embedding-3-small at its default 1536 dimensions is the right choice: it is cheaper than ada-002, better quality, and supports Matryoshka truncation for storage optimization. Teams on ada-002 should plan migration in their next technical debt sprint. Waiting until a forced deprecation creates schedule pressure and risks production downtime. The chunking strategies for RAG guide covers how chunk size affects both embedding quality and token costs for the new models.

Practical Throughput Math: How Many Documents per Minute at Each Tier?

Translating rate limits into practical throughput requires knowing your average chunk size in tokens. A typical RAG chunk — a paragraph of text, a product description, a support ticket snippet — ranges from 100 to 500 tokens. Using 250 tokens as a representative middle value: at Tier 1 with 1,000,000 TPM, the maximum sustained throughput is 4,000 chunks per minute or approximately 240,000 chunks per hour. At 250 tokens per chunk, this corresponds to 250 million tokens per hour — but note that the 3,000,000 TPD limit would be hit in about 12 minutes of sustained maximum throughput. The TPD ceiling is therefore the primary constraint for bulk indexing at Tier 1 even more than the TPM limit.

At Tier 2 with 10,000,000 TPM and the much higher TPD ceiling, the same 250-token chunks can be embedded at up to 40,000 per minute or 2.4 million per hour. A 1-million-chunk corpus would take about 25 minutes of sustained maximum throughput. This assumes the Batch API is not used — with the Batch API, even Tier 1 accounts can process large corpora efficiently because the Batch API quota is separate. At Tier 5 with 250,000,000 TPM, the theoretical throughput is 1,000,000 chunks per minute (1 billion tokens per minute), but actual achieved throughput will be limited by your client's ability to batch and submit requests at that rate, not by OpenAI's API limits.

The practical recommendation for teams doing initial corpus indexing: use the Batch API regardless of your tier. Submit files of up to 50,000 embedding requests per batch, queue as many batch jobs as needed, and let the 24-hour SLA work in your favor. This uses a separate quota, costs 50% less, and allows your synchronous embedding budget to remain fully available for real-time user requests during the indexing process. Only fall back to synchronous indexing if your use case requires real-time confirmation that each document has been indexed before proceeding.

Sourcing, Verification, and Live Data Caveats

All rate limit figures in this document are sourced from platform.openai.com/docs/guides/rate-limits and openai.com/api/pricing, verified June 2026. OpenAI's rate limits are documented per model and tier, and the documentation is updated when limits change. The figures most likely to have changed since this writing are the exact RPM and TPM values for Tier 2 and above, as OpenAI has increased these limits several times in 2024 and 2025. Always verify your current limits in the platform dashboard rather than relying on documentation or third-party references.

The model pricing figures ($0.02/1M for text-embedding-3-small, $0.13/1M for text-embedding-3-large) were accurate as of June 2026. OpenAI has reduced embedding prices several times since the models launched, and further reductions are plausible. If the gap between text-embedding-3-small and text-embedding-3-large pricing has narrowed since this writing, the quality advantage of -3-large becomes more compelling relative to its cost.

The Matryoshka dimension benchmarks referenced in this document (text-embedding-3-large at 256 dimensions outperforming ada-002 at 1536 dimensions) are based on OpenAI's published MTEB results at launch. MTEB scores are available at huggingface.co/spaces/mteb/leaderboard and are updated as new models and evaluations are submitted. For the most current performance comparisons, consult the live leaderboard rather than memorized benchmark results. The Batch API file size limit of 50,000 inputs and the 24-hour SLA are documented at platform.openai.com/docs/api-reference/batch and have been stable since the Batch API launched, but verify these limits remain current before building production pipelines that depend on them.

Managing OpenAI Embedding Rate Limits in Production

1
Calculate your TPM needs before choosing a tier
Before your first production deploy, estimate your peak token consumption per minute. Multiply your peak requests per minute by your average chunk size in tokens. Compare this against the TPM limit for your current tier. If your peak TPM need is within 60% of your tier's TPM limit, you have acceptable headroom. If it exceeds 80%, you will experience periodic 429 errors under peak load and should either upgrade your tier (by spending to the next threshold), increase your batching to reduce RPM while staying within TPM, or shift bulk workloads to the Batch API. Also calculate your daily token volume to check against the TPD limit — Tier 1's 3,000,000 TPD is the most common unexpected constraint for new teams.
2
Implement exponential backoff for 429 errors
A 429 response from the OpenAI API includes a Retry-After header indicating when the rate limit window resets. Your embedding client should implement exponential backoff: on first 429, wait the Retry-After value (or a minimum of 1 second if the header is absent); on second consecutive 429, double the wait time; continue doubling up to a maximum wait of 60-120 seconds. Do not implement a tight retry loop without backoff — this will not improve throughput and may trigger additional rate limiting. The OpenAI Python and Node.js client libraries include built-in retry with exponential backoff for rate limit errors, which is the simplest implementation path. Always log 429 occurrences with the model, request size, and timestamp to identify whether you are hitting RPM or TPM limits.
3
Use the Batch API for bulk indexing
For any embedding workload that is not interactive — initial corpus indexing, nightly re-indexing, scheduled content processing — use the Batch API instead of the synchronous API. Prepare your input as a JSONL file where each line is a complete embedding request object (model, input text, custom_id for tracking). Upload via /v1/files with purpose='batch', create a batch job via /v1/batches, then poll /v1/batches/{batch_id} until the status is 'completed'. Download the output file via the output_file_id. This costs 50% less than synchronous calls, uses a separate quota that does not affect your real-time embedding capacity, and scales to 50,000 inputs per file. For corpora larger than 50,000 chunks, submit multiple batch jobs in sequence or parallel.
4
Implement Matryoshka truncation to reduce storage costs
If you are using text-embedding-3-small or text-embedding-3-large and your vector database storage costs are meaningful, evaluate Matryoshka truncation. Set the 'dimensions' parameter in your embedding API request to a smaller value (e.g., 512 for text-embedding-3-large instead of 3072). Measure recall quality on a representative set of your actual queries by comparing top-k results between full and truncated dimensions. If recall overlap is above 85%, the truncated dimensions are likely acceptable. For a 768-dim deployment on text-embedding-3-large (a reasonable middle ground), you reduce vector storage by 4x relative to full 3072 dimensions at modest quality cost. Remember: API billing is identical regardless of dimension — the savings are entirely on the storage side in your vector database.
5
Monitor and tune with the usage dashboard
The OpenAI platform dashboard at platform.openai.com/usage shows token consumption by model, time period, and API key. Monitor your embedding model usage weekly during the early weeks of production to identify whether your consumption patterns are approaching tier limits. The rate limits page at platform.openai.com/settings/organization/limits shows your current tier, all per-model limits, and your spend-to-date toward the next tier promotion threshold. Set up a billing alert in the platform to notify you when monthly spend reaches 50% and 80% of your budget — this catches unexpected usage spikes before they become costly surprises. If you are approaching Tier 2 and your workload genuinely needs higher TPM, it is sometimes cost-efficient to run a modest amount of deliberate API usage (like a documentation indexing job you were planning anyway) to push your cumulative spend over the $50 Tier 2 threshold.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

OpenAI Batch API Limits: Quotas, File Sizes, and SLA→OpenAI Embeddings Cost Calculator→Cohere vs Voyage vs OpenAI Embeddings: Quality and Cost Compared→

Frequently Asked Questions

What happens when I hit a 429 rate limit on the OpenAI embeddings API?

The API returns a 429 Too Many Requests HTTP status code. The response body contains an error object with a 'type' of 'requests' (for RPM limit) or 'tokens' (for TPM or TPD limit). The response includes a Retry-After header with the number of seconds until the limit window resets. Well-behaved clients should honor this header and back off for at least that duration before retrying. If you do not implement backoff and immediately retry, you will continue receiving 429s until the window resets, wasting request attempts. Your application should log these events with the limit type and timestamp to diagnose which constraint is being hit.

How do I move from Tier 1 to Tier 2 on the OpenAI API?

Tier promotion is automatic and based on cumulative account spend. Tier 2 requires $50 in total spend since account creation — this includes all API usage across all models (completions, chat, embeddings, image generation), not just embedding usage. Tier promotion happens within a few hours of crossing the threshold, without any action required from you. You can see your current tier and the spend needed for the next tier in the rate limits section of your organization settings at platform.openai.com/settings/organization/limits. There is no way to manually request tier promotion — it is entirely spend-driven.

Can I use the Batch API for embeddings, or is it only for chat completions?

The Batch API supports the /v1/embeddings endpoint as of June 2026. You can submit batch files containing embedding requests for text-embedding-3-small and text-embedding-3-large. The 50% price discount and separate quota apply identically to embedding batch requests as they do to chat completion batch requests. The only constraint is the 50,000 input limit per batch file. For large corpora, submit multiple batch files — there is no documented limit on concurrent batch jobs per account, though very high numbers of simultaneous jobs may experience longer processing times.

Should I migrate from text-embedding-ada-002 to the new models?

Yes, and you should do it proactively rather than waiting for a forced deprecation. text-embedding-ada-002 is explicitly deprecated by OpenAI. text-embedding-3-small is cheaper (roughly 5x lower pricing), better quality on standard benchmarks, and supports Matryoshka truncation. text-embedding-3-large is the highest quality OpenAI embedding model currently available. Migration requires re-embedding your entire corpus because the vector representations are not compatible between models — you cannot mix ada-002 and text-embedding-3 vectors in the same index. Use the Batch API for the re-embedding run to minimize cost and avoid impacting your synchronous quota.

Does using fewer dimensions with Matryoshka affect the token count billed by OpenAI?

No. OpenAI charges for input tokens (the text you are embedding), not for output vector dimensions. Whether you request a 256-dimensional or 3072-dimensional output from text-embedding-3-large, the API cost is identical for the same input text. The savings from Matryoshka truncation are entirely on the downstream storage side — smaller vectors mean lower RAM requirements in your vector database, lower storage costs in managed vector databases that charge by storage volume, and faster nearest-neighbor search due to reduced vector size. If you are on a tight embedding budget, the right optimization is choosing text-embedding-3-small over text-embedding-3-large rather than Matryoshka truncation.

What is the RPM limit for text-embedding-3-large specifically?

As of June 2026, text-embedding-3-large shares the same RPM limits as text-embedding-3-small: 3,000 RPM on Tier 1, 5,000 RPM on Tiers 2 and 3, and 10,000 RPM on Tiers 4 and 5. The TPM limits are the same across both embedding models within each tier. The models do not have different rate limit schedules from each other — they share the same quota pool for requests and tokens. Verify this remains true for your account in the platform rate limits dashboard, as OpenAI occasionally adjusts limits per model based on demand.

How does concurrent request batching help with embedding throughput?

Sending multiple concurrent requests (each containing a large batch of inputs) allows you to approach your TPM ceiling faster than sequential requests. If your embedding pipeline runs one request at a time, you are limited by latency per request multiplied by max batch size. With 5 concurrent requests each containing 500 inputs at 200 tokens each, you submit 500,000 tokens in a single second of wall time — assuming the API handles them at full speed. Concurrency does not raise your rate limits; it helps you utilize your full TPM budget more efficiently by reducing the idle time between requests. Most embedding pipeline frameworks support configurable worker concurrency. Start at 3-5 concurrent workers for Tier 1 and increase if you see headroom in your TPM utilization metrics.

What is the maximum batch size per embedding API request?

The OpenAI embeddings API accepts up to 2048 input strings per synchronous request as of June 2026. This is the batch size for a single /v1/embeddings call. For the Batch API, the limit per file is 50,000 input objects — each object is one embedding request, which itself could contain a single string. The practical optimal synchronous batch size for throughput is the largest batch that keeps individual request latency under your application's timeout threshold and keeps total tokens per request within the TPM-per-request soft limits. For most use cases, batches of 100-500 inputs per request provide good throughput without excessive latency per call.

Calculate Your Embedding Costs Before You Index

Use the embeddings cost calculator to model your monthly OpenAI embedding spend across text-embedding-3-small, text-embedding-3-large, and Batch API pricing. Enter your corpus size and daily query volume to see whether text-embedding-3-small or a higher-dimension configuration makes more sense for your workload.

Browse all prompt tools →