Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

OpenAI Assistants API v2 Rate Limits 2026: Files, Threads, Runs, and Cost

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Before diving into Assistants-specific limits, it's worth comparing them to Claude's equivalent system. Our Claude API rate limits 2026 page covers Anthropic's tier-by-tier TPM, RPM, and ITPM ceilings — Claude's direct API has a simpler architecture than Assistants (no managed threads, no vector stores), and for many use cases the tradeoff between control and convenience is the central architectural decision.

The OpenAI Assistants API v2 — documented at https://platform.openai.com/docs/assistants/overview — is OpenAI's managed agent framework. It handles thread state persistence, automatic context truncation, vector store retrieval for file_search, and sandboxed Python execution via code_interpreter. In exchange for this convenience, you accept a more complex rate-limit surface: Assistants limits apply at the vector store level, the run level, the thread level, and the API tier level simultaneously. Understanding which limit applies where is the foundation of a reliable Assistants deployment.

This page is the complete reference for Assistants API v2 quotas as of June 2026. For cost modeling, see our OpenAI API cost calculator. If you're evaluating whether to build on Assistants or raw chat completions, the OpenAI to Claude migration tutorial compares both approaches. And our Claude API rate limits 2026 reference covers the Claude-side numbers if you're doing a cross-platform comparison.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

OpenAI Assistants API v2 limits — June 2026

Feature
Limit
Value
Notes
Max files per vector store10,000 filesHard limit per vector store; create multiple vector stores if needed
Max file size for file_search512 MB per fileSupported types: PDF, DOCX, HTML, JSON, MD, TXT, and others
Max vector store storage1 GB free, then $0.10/GB/dayFree tier resets monthly; storage billed daily beyond 1 GB
Max messages per threadNo hard limitReal constraint is the 128k context window per run; older messages auto-truncated
Max context tokens per run128,000 tokensFor gpt-5.5 and gpt-5.4; set max_prompt_tokens to prevent overages
Max tokens per messageCounted against run contextNo per-message cap; the run's context window is the constraint
Parallel function tool callsUp to 128 per run stepModel can fan out multiple function calls in a single required_action state
code_interpreter execution limit120 seconds, 512 MB memoryPython only; session billed at $0.03/session for duration of run
Max function tools per assistant128Higher than the 64-tool limit for direct chat completions API
Run expiry10 minutesRuns expire if not polled/completed; required_action tool outputs must be submitted within 10 min
Thread and message retention60 daysThreads and messages retained 60 days unless explicitly deleted via API
File retentionUntil deleted by userFiles in vector store persist until you call the delete endpoint
Rate limitsInherit API key tierEach model call in a run counts as a chat completion for tier RPM/TPM purposes
Annotation tokensBilled as output tokensFile citations and annotations in assistant responses billed at standard output token rates

Sources, fetched 2026-06-21: https://platform.openai.com/docs/assistants/overview, https://platform.openai.com/docs/guides/rate-limits, https://openai.com/api/pricing/

Assistants API v2 architecture: threads, runs, and what limits where

The Assistants API v2 is built around three core abstractions: an Assistant (which holds tools, instructions, and model configuration), Threads (which hold the message history for a conversation), and Runs (which execute the assistant against a thread to generate a response). Understanding which limit lives at which layer is essential for diagnosing production issues — a vector store quota error, a run expiry, and a TPM rate limit look very different and require different fixes.

**Vector store limits apply at the assistant level**: each assistant can attach up to 50 vector stores for file_search, each vector store holds up to 10,000 files, and each file can be up to 512 MB. These limits govern your document retrieval capability — if you're building a knowledge base assistant, the vector store architecture decisions (how many vector stores, how to partition files across them) happen at assistant creation time, not at run time. Vector store storage is billed separately: 1 GB free, then $0.10/GB/day beyond that.

**Context window limits apply at the run level**: each Run executes the assistant against a thread and has access to a finite context window — 128,000 tokens for gpt-5.5 and gpt-5.4. The full content of this window includes the assistant's system instructions, all messages in the thread (up to the window limit), any file_search retrieved chunks, function tool outputs, and code_interpreter results. **The automatic truncation strategy OpenAI applies to long threads is important to understand**: by default, older messages are dropped to make room for newer ones and the assistant's response. You can customize this with the truncation_strategy parameter on the run.

**Retention limits apply at the thread and message level**: threads and their messages are retained for 60 days from the last activity, then automatically deleted. Files in vector stores persist until you explicitly delete them via the API. Runs expire after 10 minutes — which is a hard limit on how long a run can remain in a required_action state (waiting for you to submit tool outputs). If your server doesn't submit tool outputs within 10 minutes, the run transitions to expired and cannot be resumed.

The v1 to v2 migration introduced several important changes: improved file_search with semantic chunking options, the ability to attach vector stores at the thread level (not just the assistant level), better streaming support via the streaming API (eliminating the need for polling), and improved token usage reporting. **If you're running any v1 Assistants code, migration to v2 is strongly recommended** — the v1 API is deprecated and the improved v2 architecture makes both cost optimization and rate-limit management significantly easier. See https://platform.openai.com/docs/assistants/overview for the full v2 migration guide.


File search limits: vector stores, max files, and storage pricing

File search is the Assistants API's managed RAG (retrieval-augmented generation) system. When you add files to a vector store, OpenAI automatically chunks, embeds, and indexes them. When an assistant run invokes file_search, the API retrieves semantically relevant chunks and injects them into the context window. The entire pipeline — chunking, embedding, storage, retrieval — is managed for you, but it comes with specific limits that govern how much you can store and retrieve.

**The 10,000-file ceiling per vector store is the hard limit** you'll hit first if you're building a large document repository. At 512 MB per file maximum, a fully loaded vector store could theoretically hold up to 5 TB of documents (10,000 × 512 MB), though storage costs ($0.10/GB/day) make this prohibitive at scale. In practice, most production use cases work well within these limits — a customer support knowledge base with 5,000 product documentation pages, a legal research assistant with 8,000 case documents.

Storage pricing is structured as: 1 GB free per month, then $0.10/GB/day. This daily billing model means costs compound quickly for large stores. A 10 GB vector store costs $0.90/day ($27/month) just in storage. A 100 GB store costs $9.90/day ($297/month). **Budget vector store storage separately from model token costs** — for document-heavy applications, storage can become the dominant cost line. Set auto-expiry policies on vector stores that aren't actively used to prevent storage cost accumulation from stale data.

File type support is broad: PDF, DOCX, HTML, JSON, Markdown, plain text, and several programming language formats. The file_search tool handles chunking automatically with configurable chunk_size and chunk_overlap parameters on the vector store. The default chunk size is 800 tokens with 400-token overlap — this works well for most prose documents but may not be optimal for code files or structured data. For technical documentation with many short code snippets, smaller chunk sizes (300–400 tokens) often improve retrieval precision.

Each assistant can attach up to 50 vector stores, but only 1 vector store can be attached at the thread level (for per-conversation context). This architectural distinction matters: if you want different users to have different document contexts, you can create per-user vector stores and attach them at the thread level rather than the assistant level. This also keeps the assistant-level vector store stable (good for caching), while allowing per-thread variability.

Supported file types as of June 2026 include: .pdf, .docx, .doc, .html, .json, .md, .txt, .tex, .py, .js, .ts, .java, .cs, .cpp, .c, .rb, .go, and more. Image files and video files are NOT supported for file_search — those require the vision and audio capabilities in the main chat completions API. Check the current list at https://platform.openai.com/docs/assistants/overview as supported types continue to expand.


Thread and message limits: context window is the real constraint

There is no hard limit on the number of messages you can add to an Assistants API thread. A thread can accumulate hundreds or thousands of messages over its 60-day retention window. However, **the context window per run is the real constraint**: each Run has access to at most 128,000 tokens of context on gpt-5.5 and gpt-5.4. When the total token count of the thread's messages, instructions, tool outputs, and retrieved file chunks exceeds this limit, OpenAI's automatic truncation kicks in.

The default truncation strategy drops older messages from the thread to fit within the context window. The most recent messages are preserved; the oldest are removed. This is often the correct behavior for chat applications where recent context is most relevant, but it can cause subtle failures in applications where historical messages contain important state (user preferences set at the start of the conversation, a document analyzed several turns ago). **For stateful agents, the truncation strategy should be explicitly configured, not left at default.**

You can control truncation with the truncation_strategy parameter on each Run: set type: "last_messages" with a last_n value to specify exactly how many messages to retain. Alternatively, set max_prompt_tokens to an absolute token budget for the prompt (system instructions + message history + tool outputs), and max_completion_tokens to cap the assistant's response length. Setting both parameters gives you predictable, bounded token costs per run rather than variable costs driven by thread growth.

**The context window limit also applies to file_search retrieved chunks.** Each run of a file_search invocation retrieves the top-N most relevant chunks from your vector store and injects them into the context. OpenAI automatically limits the total retrieved chunk size to stay within the context window, but this means that very long threads leave less room for retrieved content. For applications that rely heavily on file retrieval, keep threads shorter or use the truncation_strategy to aggressively prune message history.

Thread retention is 60 days from the last activity on the thread. After 60 days of inactivity, the thread and all its messages are automatically deleted. If your application needs to maintain conversation context beyond 60 days (for long-running projects, customer accounts, etc.), you must either refresh the thread's activity within the window, export and reconstruct the relevant context into a new thread, or store your own message history externally and reinject it as needed. The 60-day limit is not configurable — it applies to all API tiers including Enterprise.


Function tools: 128 tools per assistant, parallel calls

The Assistants API supports up to 128 function tools per assistant — notably higher than the 64-tool limit on the direct chat completions API. This makes the Assistants API more accommodating for large-scale agent designs that need broad tool coverage. However, the same practical advice applies: **most production agents work best with 10–30 focused tools, not 128**, because the model's tool selection accuracy degrades as the tool count grows and the average per-call overhead of tool definitions increases with count.

Function tool schemas in Assistants follow the same JSON Schema draft-07 constraints as the direct chat completions API: type, properties, required, enum, nested objects, and array types are all supported. $ref and recursive schemas are not supported. Tool names must be unique within an assistant, and the same naming conventions apply: snake_case, ASCII alphanumeric plus underscores, 1–64 characters. The tool schema format for Assistants uses the function object structure identical to the chat completions API, making migration between the two straightforward.

Parallel function calling in Assistants works through the required_action run state. When Claude (or GPT) decides to call multiple functions simultaneously, the run transitions to required_action and includes a submit_tool_outputs object containing all the pending tool calls. Your server must handle all pending tool calls and submit all results in a single submit_tool_outputs API call — you cannot submit partial results for some tools and leave others pending. **All tool outputs for a single required_action must be submitted together** before the run can continue.

The 10-minute run expiry deadline is most critical during required_action handling. If your function tools make slow external API calls — database queries, third-party webhooks, email sends — you must complete all of them and submit the results within 10 minutes of the run entering required_action state. If you miss this deadline, the run expires and cannot be resumed. The only recovery is to create a new run, which means the assistant will need to re-reason about what tools to call. **Design your tool implementations with a total execution time budget of 8 minutes (leaving 2 minutes of margin)** and implement timeouts on all external calls.

The streaming API is the recommended approach for managing the required_action pattern in production. With streaming, your server receives an event when the run enters required_action state, can immediately begin executing tool calls, and submits results as soon as they're available — all without the latency overhead of polling. Streaming also eliminates the polling loop entirely for non-tool-calling runs, reducing both latency and API call count. See https://platform.openai.com/docs/assistants/overview for the streaming event format and the submit_tool_outputs endpoint.

For function tools that might take a long time, consider implementing a 'start task and poll' pattern: your tool starts an async job, returns a job_id immediately, and provides a separate check_job_status tool that polls for completion. This keeps each individual tool execution fast (under 30 seconds) while supporting longer-running background tasks. The assistant can call check_job_status in subsequent turns without hitting the 10-minute run expiry on the initial required_action state.


code_interpreter: resource limits and use cases

code_interpreter gives your assistant access to a sandboxed Python execution environment. The assistant can write Python code, execute it, observe the output (including generated files), and iterate — all within a single run. It's one of the most powerful Assistants features for data analysis, report generation, and image manipulation, but it comes with specific resource limits that define what's feasible in a single session.

**The hard resource limits**: 120 seconds of execution time per code block, 512 MB of memory, Python-only execution (no Node.js, Ruby, etc.). The 120-second limit applies per code execution step — if the assistant generates code that runs for more than 120 seconds (complex data processing, large model inference, slow nested loops), it will time out and the code_interpreter will return an error. **The practical workaround for tasks that exceed 120 seconds is to break them into multiple smaller runs**, with intermediate state persisted as files and uploaded back to the vector store or returned as attachments.

The cost model for code_interpreter is $0.03 per session. A session lasts for the duration of the run — so if a single run invokes code_interpreter 10 times (the assistant iterates on its code), that's still one session billed at $0.03. However, if the run expires and you create a new run to continue the analysis, that new run incurs a new $0.03 session charge. **At $0.03/session, code_interpreter is very affordable for occasional use but becomes meaningful at scale**: 1,000 sessions/day = $30/day, 30,000 sessions/month = $900/month.

File handling in code_interpreter is one of its most powerful features. You can upload files to the run (CSV, Excel, JSON, images, PDFs) and the assistant's code can read and process them within the sandbox. The assistant can also generate output files (charts as PNGs, processed CSVs, generated PDFs) and return them as run attachments. **This file I/O capability makes code_interpreter ideal for data pipeline tasks**: uploading a dirty CSV, having the assistant clean and transform it with Python, and downloading the result — all within a single API interaction.

Use cases where code_interpreter excels: data analysis and visualization (pandas + matplotlib), PDF generation from structured data (reportlab, WeasyPrint), image manipulation (Pillow, OpenCV), mathematical calculations and simulations (numpy, scipy), and code execution for educational applications where you want to show the output of user-provided code. Use cases where code_interpreter is NOT the right tool: tasks requiring more than 512 MB RAM (large dataset processing, ML model inference), tasks requiring more than 120 seconds of compute, or tasks requiring non-Python runtimes.

Pricing at https://openai.com/api/pricing/ confirms the $0.03/session rate. Model tokens used within a code_interpreter session are billed at standard chat completion rates for the model — so a gpt-5.5 run that involves extensive back-and-forth code iteration will accumulate meaningful token costs in addition to the $0.03 session fee. Monitor the usage field in run responses to understand the full token cost of code-interpreter-heavy agents.


Rate limits for Assistants API: how tiers apply

The Assistants API does not have its own separate rate limit tier — it inherits the rate limits of your API key and the underlying model. Every call that the Assistants API makes to a language model (during a run) counts as a chat completion API call for rate-limit purposes. **If your Assistants runs invoke gpt-5.5, those model calls count against your gpt-5.5 RPM and TPM limits, exactly as if you had called the chat completions endpoint directly.**

The Assistants-specific endpoints (thread.create, message.create, run.create, etc.) have their own separate RPM limits that are independent of the model tier limits. These endpoint-specific limits are typically lower than the model RPM limits — documentation at https://platform.openai.com/docs/guides/rate-limits shows these as roughly 3,000–5,000 RPM for most tiers on the management endpoints. For applications making many thread or message operations per minute (a high-volume chat application), these management endpoint limits can become a bottleneck before the model-level limits are reached.

Rate limit tiers for OpenAI are documented at https://platform.openai.com/docs/guides/rate-limits and run from Tier 1 (new accounts, limited throughput) through Tier 5 (high-spend accounts, highest throughput). Each tier increase requires meeting a spending threshold over time. For gpt-5.5 on Tier 5, TPM limits are high enough for most production workloads — but Tier 1 restrictions can be a significant constraint for early-stage production deployments.

**The most common rate-limit failure mode in Assistants is not the model TPM limit, but the per-minute limits on the run and message management endpoints.** If your application is creating many threads and runs in a short window (a batch document analysis job, a load test, a traffic spike), you can hit the thread.create or run.create RPM limit while staying well under the model's TPM ceiling. Implement exponential backoff with jitter on all Assistants API calls — not just the model-facing ones — to handle these endpoint-specific limits gracefully.

Token counting for Assistants runs is reported in the usage field of the completed run object: prompt_tokens (all input tokens consumed across all model calls in the run, including message history, instructions, retrieved chunks, and tool results), completion_tokens (all output tokens generated), and total_tokens. **Note that a single run may make multiple model calls** (if the assistant calls tools, tool results are submitted, and the assistant continues reasoning) — the usage field reports the aggregate across all calls within that run. This is different from the direct chat completions API where each call has its own separate usage object.

For cost attribution in multi-tenant Assistants deployments, use run metadata to tag each run with a user ID or organization ID. OpenAI's usage dashboard at https://platform.openai.com/usage allows filtering by API key and model, but not by run metadata — so for per-user cost attribution, you'll need to maintain your own mapping of run IDs to users and aggregate costs from the run usage fields. This is particularly important for B2B SaaS applications where you need to pass through or monitor per-customer API consumption.


Run state machine: polling, streaming, and expiry

Every Assistants API run transitions through a defined set of states: queued (waiting to execute), in_progress (executing), requires_action (waiting for tool outputs), completed (finished successfully), failed (encountered an error), expired (timed out before completion or tool outputs were submitted), and cancelled (explicitly cancelled). Understanding this state machine — and specifically the expiry and required_action states — is essential for building reliable Assistants-based applications.

The 10-minute run expiry is the sharpest edge in the state machine. A run enters expired if it remains in queued or in_progress state for more than 10 minutes without progressing, or if it enters requires_action state and tool outputs are not submitted within 10 minutes. **The second case is the most common production failure mode**: a function tool makes a slow external API call, your server is under load, or a network timeout occurs, and the run expires before tool outputs are submitted. Recovery requires creating a new run, losing all the progress the current run had made.

Traditional polling is the simple approach: after creating a run with run.create, poll run.retrieve every 1–2 seconds until the run reaches a terminal state (completed, failed, expired, cancelled) or requires_action state. The downside is that this generates many API calls (60–120 calls for a 1–2 minute run) and adds latency because you're polling on an interval rather than reacting immediately to state changes. **The streaming API is strongly preferred for production deployments** — it delivers events in real time as they occur, eliminates polling overhead, and allows immediate reaction to required_action states.

With the streaming API, you open a Server-Sent Events connection when you create the run. OpenAI sends events for each state transition, each text delta (as the assistant generates its response), each tool_calls event (when the assistant decides to call a function), and the final done event. Your server can process these events in real time and submit tool outputs immediately when the required_action event arrives — typically within seconds, well inside the 10-minute deadline. This approach also dramatically reduces the perceived latency from the user's perspective, since they see the assistant's response streaming character-by-character rather than waiting for the full response.

For required_action handling with streaming, the pattern is: listen for the thread.run.requires_action event, extract the tool_calls from the event data, execute all tool calls concurrently (using Promise.all or asyncio.gather), submit all results to the runs.submitToolOutputs streaming endpoint (which keeps the stream open and continues delivering events). The full streaming event reference is documented at https://platform.openai.com/docs/assistants/overview.

Cancelled runs are created by explicitly calling run.cancel. This is useful for building responsive user-facing applications where the user can interrupt a long-running agent mid-generation. Note that cancelled runs are not billed differently — all tokens consumed before cancellation are still charged. Build cancellation logic into your UI and server to allow users to stop expensive runs that are going in the wrong direction before they consume the full context window.


Cost model for Assistants API: what gets billed

The Assistants API has a multi-component cost model that's more complex than the direct chat completions API. Understanding each billing component is essential for accurate cost modeling and avoiding surprise charges at end of month. **The four main billing components are: model tokens, vector store storage, code_interpreter sessions, and vision tokens.**

Model token billing covers all tokens consumed by the language model during a run — the assistant's system instructions, all messages in the thread's context window, file_search retrieved chunks, function tool descriptions and tool outputs, and the assistant's generated response. These are billed at the standard chat completion rates for the model: GPT-5.5 at $5/M input and $25/M output, GPT-5.4 at $2.50/M input and $15/M output, GPT-5 mini at $0.30/M input and $1.20/M output (current rates per https://openai.com/api/pricing/). **Note that token costs in Assistants runs are typically higher than equivalent direct chat completion calls** because the Assistants API always injects system context, thread history, and potentially retrieved file chunks — even for simple queries.

Vector store storage billing is $0.10/GB/day beyond the 1 GB free tier. This is a daily recurring cost — a 5 GB vector store costs $0.40/day ($12/month). For applications that upload and retain large document sets, storage costs can become the dominant cost line. Implement vector store expiry policies: set expiry_after parameters on vector stores to automatically delete stores after a specified number of days of inactivity. This is particularly important for per-user or per-session vector stores that aren't needed after the conversation ends.

code_interpreter billing is $0.03 per session. A session is created when a run that uses code_interpreter starts and ends when the run completes or expires. Multiple code executions within a single run share one session charge. If you're building a data analysis assistant that users interact with over multiple runs, each run incurs a separate $0.03 session charge — there's no session continuity across runs. At moderate volumes ($0.03 × 500 sessions/day = $15/day), this is manageable; at high volumes ($0.03 × 10,000 sessions/day = $300/day), it becomes significant.

**Comparing Assistants API cost to raw chat completions**: for a simple question-answering system with 10 messages of thread history and no tools, the Assistants API adds overhead versus raw chat completions — the thread management API calls, the instruction injection, and the context window management all consume tokens that a direct API call might avoid. The Assistants API earns its premium when you need managed thread persistence, vector store retrieval, or code execution — replacing infrastructure you'd have to build yourself. For stateless, one-shot completions, raw chat completions (or Claude's direct API) are more cost-effective. The threshold calculation: if you'd otherwise spend more than $30/month on storage infrastructure, RAG pipeline, and thread management code, Assistants API overhead pays for itself.

Building an Assistants API v2 production deployment

  1. 1

    Create your assistant with only the tools you need

    When creating your assistant, only enable the tools your specific use case requires. If you're building a Q&A assistant over documents, enable file_search only — do not also enable code_interpreter or function tools you don't plan to use. Each enabled tool adds to the assistant's instructions context and the model's tool selection overhead on every run. A document retrieval assistant with only file_search enabled runs faster and cheaper than one with all three tool types enabled. Set your assistant instructions to be concise — every token in the instructions is included in every run's context window, billed at the input rate. Trim instructions to the minimum effective guidance.

  2. 2

    Design your vector store for file_search efficiency

    Group logically related files into focused vector stores rather than creating one giant store for everything. A customer support assistant might have separate vector stores for product documentation, pricing pages, and troubleshooting guides. Focused stores improve retrieval precision. Respect the 10,000-file ceiling by implementing a rotation policy for high-volume document ingestion: archive and delete the oldest files when approaching the limit. Enable auto_chunking for most prose documents and set explicit chunk_size for technical documentation with short code snippets (300–400 tokens works better than the 800-token default for code-heavy content). Set expiry_after policies on stores that are session-scoped or rarely accessed to prevent storage cost accumulation.

  3. 3

    Manage thread context window with explicit token budgets

    Set max_prompt_tokens and max_completion_tokens on every run rather than relying on the default auto-truncation. This gives you predictable, bounded token costs per run. A document Q&A assistant with concise answers might set max_prompt_tokens: 20000 (enough for instructions + recent messages + retrieved chunks) and max_completion_tokens: 500. Configure truncation_strategy to type: 'last_messages' with a specific last_n value for conversational applications where recent context is most relevant. For analytical applications where historical context matters, implement your own context management: summarize older messages and inject the summary as a system message rather than retaining all original messages.

  4. 4

    Handle required_action runs with the streaming API

    Switch from polling to streaming for all Assistants API runs in production. With the streaming API, your server receives a thread.run.requires_action event the instant the model decides to call function tools — no polling interval overhead, no missed windows. Immediately extract all pending tool_calls from the event, execute them all concurrently (Promise.all / asyncio.gather), and submit all results via the streaming submit_tool_outputs endpoint. Build a strict 8-minute timeout on all tool executions (leaving 2 minutes before the 10-minute run expiry), with a fallback that submits an error message for any tool that times out rather than letting the whole run expire. The streaming API reference at https://platform.openai.com/docs/assistants/overview includes the complete event format.

  5. 5

    Monitor token consumption and cost per run

    Log the usage field from every completed run: prompt_tokens, completion_tokens, and total_tokens. Tag each run with user_id, session_id, and feature_name metadata at creation time so you can attribute costs accurately in your analytics. Set per-run cost alerts: if a single run exceeds your target budget threshold (for example, $0.10 per run for a gpt-5.4 assistant), log it as a high-cost run for investigation. Track the distribution of prompt_tokens across runs — a long tail of very high-prompt-token runs typically indicates thread history is not being truncated aggressively enough. Check the OpenAI usage dashboard at https://platform.openai.com/usage monthly to reconcile your logged costs with actual billing.

Frequently Asked Questions

How many files can I add to an Assistants v2 vector store?

Each vector store in the Assistants API v2 supports up to 10,000 files, with a maximum size of 512 MB per file. Vector store storage is free for the first 1 GB, then billed at $0.10 per GB per day beyond that. A single assistant can attach up to 50 vector stores for file_search. If you need more than 10,000 files, split them across multiple vector stores and attach all of them to your assistant. File types supported include PDF, DOCX, HTML, JSON, Markdown, plain text, and most programming language files. Check https://platform.openai.com/docs/assistants/overview for the current full list of supported file types, which continues to expand.

Is there a limit on messages per thread in the Assistants API?

There is no hard limit on the number of messages in an Assistants API thread. However, the practical constraint is the context window per run: gpt-5.5 and gpt-5.4 have a 128,000-token context window. When the cumulative token count of thread messages, system instructions, file_search retrieved chunks, and tool outputs exceeds this window, older messages are automatically truncated. The default truncation strategy drops older messages first. For production applications with long-running threads, explicitly configure the truncation_strategy parameter and set max_prompt_tokens on each run to get predictable, bounded token costs rather than variable costs driven by thread growth. Threads and messages are retained for 60 days from last activity.

What happens when an Assistants run expires?

An Assistants run expires after 10 minutes if it does not complete or transition through required_action successfully within that window. The most common cause of expiry is a run entering required_action state (waiting for function tool outputs) and your server failing to submit those outputs within 10 minutes — due to slow external API calls, a server crash, or a network timeout. Once expired, the run cannot be resumed. Recovery requires creating a new run, which re-invokes the model and re-reasons about which tools to call, consuming additional tokens and incurring additional cost. The streaming API is strongly recommended to minimize required_action handling time, since it delivers the requires_action event immediately rather than waiting for a polling interval.

Can I use parallel function calls in Assistants API v2?

Yes. The Assistants API v2 supports parallel function calling: the model can decide to call multiple function tools simultaneously within a single run step. When this happens, the run enters required_action state with a submit_tool_outputs object containing all pending tool calls. Your server must handle all pending tool calls and submit all results in a single submit_tool_outputs API call — partial submissions are not supported. Execute all pending tool calls concurrently (using Promise.all or asyncio.gather) to minimize the time spent in required_action state and reduce the risk of hitting the 10-minute run expiry. The Assistants API supports up to 128 function tools per assistant, versus the 64-tool limit on the direct chat completions API.

How is code_interpreter billed in Assistants API?

code_interpreter sessions are billed at $0.03 per session. A session lasts for the duration of a single run — if the assistant executes Python code 5 times within one run (iterating on its analysis), that's still one session at $0.03. If a run expires and you create a new run to continue, that new run incurs a fresh $0.03 session charge. Model tokens consumed during a code_interpreter run are billed separately at standard chat completion rates. code_interpreter runs in a sandboxed Python environment with a 120-second execution limit per code block and 512 MB memory — for tasks requiring longer computation, break the work into multiple runs with intermediate state persisted as files.

Do Assistants API calls count against my rate limit?

Yes. Each model call that the Assistants API makes during a run counts as a chat completion API call for your tier's RPM and TPM limits. If you're using gpt-5.5-based assistants, every run that invokes the model counts against your gpt-5.5 RPM and TPM ceilings, just as if you'd called the chat completions endpoint directly. In addition, the Assistants management endpoints (thread.create, message.create, run.create, etc.) have their own separate RPM limits — typically 3,000–5,000 RPM for most tiers — which are independent of the model-level limits. For high-volume applications, the management endpoint RPM limits can become a bottleneck before model-level limits are reached. Implement exponential backoff with jitter on all Assistants API calls to handle these limits gracefully.

How long are threads and files retained in Assistants API?

Threads and their messages are retained for 60 days from the last activity on the thread, then automatically deleted. Files uploaded to vector stores are retained indefinitely — until you explicitly delete them via the API. Runs expire after 10 minutes if not completed. There is no way to extend thread retention beyond 60 days; if your application needs longer-lived conversation history, you must store the relevant context externally (in your own database) and reinject it into a new thread as needed. For files, implement explicit deletion policies for files that are no longer needed — abandoned vector store files continue to incur storage charges ($0.10/GB/day after 1 GB) even if they're never queried.

What's the difference between Assistants API and raw chat completions for building agents?

The Assistants API manages thread persistence, vector store document retrieval, and code execution infrastructure for you — eliminating the need to build those components yourself. Raw chat completions give you maximum control: you manage your own message history, build your own RAG pipeline, and design your own code execution sandbox, but you pay only for the tokens you explicitly include in each request, with no managed-service overhead. The Assistants API makes practical sense when you need all three managed features (persistent threads + file search + code interpreter), are building a product quickly, or want to avoid the infrastructure complexity. Raw chat completions are more cost-effective for stateless one-shot queries, when you need a non-Python code runtime, or when you want fine-grained control over exactly what goes into each context window.

Build Assistants API prompts that stay inside your token budget.

Our AI Prompt Generator writes instruction-tuned system prompts for OpenAI Assistants — concise, structured, cache-friendly. 14-day free trial, no card.

Browse all prompt tools →