Assistants API v2 architecture: threads, runs, and what limits where
The Assistants API v2 is built around three core abstractions: an Assistant (which holds tools, instructions, and model configuration), Threads (which hold the message history for a conversation), and Runs (which execute the assistant against a thread to generate a response). Understanding which limit lives at which layer is essential for diagnosing production issues — a vector store quota error, a run expiry, and a TPM rate limit look very different and require different fixes.
**Vector store limits apply at the assistant level**: each assistant can attach up to 50 vector stores for file_search, each vector store holds up to 10,000 files, and each file can be up to 512 MB. These limits govern your document retrieval capability — if you're building a knowledge base assistant, the vector store architecture decisions (how many vector stores, how to partition files across them) happen at assistant creation time, not at run time. Vector store storage is billed separately: 1 GB free, then $0.10/GB/day beyond that.
**Context window limits apply at the run level**: each Run executes the assistant against a thread and has access to a finite context window — 128,000 tokens for gpt-5.5 and gpt-5.4. The full content of this window includes the assistant's system instructions, all messages in the thread (up to the window limit), any file_search retrieved chunks, function tool outputs, and code_interpreter results. **The automatic truncation strategy OpenAI applies to long threads is important to understand**: by default, older messages are dropped to make room for newer ones and the assistant's response. You can customize this with the truncation_strategy parameter on the run.
**Retention limits apply at the thread and message level**: threads and their messages are retained for 60 days from the last activity, then automatically deleted. Files in vector stores persist until you explicitly delete them via the API. Runs expire after 10 minutes — which is a hard limit on how long a run can remain in a required_action state (waiting for you to submit tool outputs). If your server doesn't submit tool outputs within 10 minutes, the run transitions to expired and cannot be resumed.
The v1 to v2 migration introduced several important changes: improved file_search with semantic chunking options, the ability to attach vector stores at the thread level (not just the assistant level), better streaming support via the streaming API (eliminating the need for polling), and improved token usage reporting. **If you're running any v1 Assistants code, migration to v2 is strongly recommended** — the v1 API is deprecated and the improved v2 architecture makes both cost optimization and rate-limit management significantly easier. See https://platform.openai.com/docs/assistants/overview for the full v2 migration guide.