By The DDH Team · Digital Dashboard Hub

Together AI Rate Limits 2026: Build, Scale, Enterprise — Per-Model Ceilings

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Together AI's rate-limit model is materially different from OpenAI's. Where OpenAI publishes a fixed RPM/TPM table per tier per model, Together's serverless rate limits are dynamic — they adjust based on your organization's successful usage patterns and the model's current shared-fleet capacity. There is no static 'Build tier = X RPM on Llama 3.3 70B' number to memorize. Instead, your live ceiling is surfaced through the **`x-ratelimit-limit`**, **`x-ratelimit-remaining`**, and **`x-ratelimit-reset`** headers on every API response, and you size capacity against the headers, not against a doc page.

Three tiers structure the ladder. **Build** is the default once you enable billing — generous enough for prototyping, eval runs, internal tools, and modest production traffic. **Scale** is a support request — higher serverless ceilings, faster header expansion under steady traffic, priority routing during shared-fleet incidents. **Enterprise** is a signed contract — negotiated per-model ceilings, dedicated capacity commitments, SLAs, custom data-handling terms. Most teams start on Build, move to Scale around the 5-10M tokens/day mark, and either upgrade to Enterprise or migrate hot paths onto Dedicated Endpoints when sustained throughput becomes the binding constraint.

The fourth path is **Dedicated Endpoints** — reserved-hardware inference at $6.49/hr (H100 80GB), $7.89/hr (H200 140GB), or $11.95/hr (B200 180GB), per Together's dedicated endpoints docs. Dedicated bypasses serverless rate limits entirely: you pay per-GPU-hour, you get all the throughput that hardware delivers, and there is no `x-ratelimit-remaining` to manage. The crossover math is straightforward — at sustained ~50k+ tokens/sec of Llama 3.3 70B traffic, a dedicated H100 or H200 is cheaper than serverless per-token billing. Below: the per-model Build vs Scale ceilings most teams see in practice, the separate quotas for embeddings (BGE, M2-BERT), fine-tuning, and image models (FLUX.1, SDXL), 429 handling, and the live-verify checklist. Sibling references: Groq rate limits · Fireworks AI rate limits · Embeddings cost calculator.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Together AI rate limits — Build vs Scale tier — June 2026 (Llama 3.3 70B baseline)

Feature	Build RPM	Build TPM	Scale (request)
Llama 3.3 70B Instruct Turbo	~600 RPM	~250k TPM	5,000+ RPM / 2M+ TPM
Llama 3.3 8B Instruct Turbo	~1,200 RPM	~500k TPM	10,000+ RPM / 4M+ TPM
DeepSeek R1	~120 RPM	~80k TPM	1,000+ RPM / 500k+ TPM
Qwen 2.5 72B Instruct Turbo	~400 RPM	~200k TPM	3,000+ RPM / 1.5M+ TPM
FLUX.1 schnell	~60 IPM	n/a (per-image)	300+ IPM
BGE Large 1.5 embeddings	~3,000 RPM	~1M TPM	20,000+ RPM / 8M+ TPM

Source, as of June 2026: Together AI rate-limits documentation (https://docs.together.ai/docs/rate-limits) and observed values from production accounts on the Build tier. Together publishes **no fixed per-model rate-limit numbers** — the official position is that limits are dynamic, adjust with usage, and should be read from the `x-ratelimit-limit` / `x-ratelimit-remaining` / `x-ratelimit-reset` response headers. The values above are typical steady-state ceilings reported by production Build-tier accounts in June 2026 and are intended as planning baselines, not contractual guarantees. Always verify your own headers before sizing capacity. Scale-tier values are typical upgrades granted by Together support on request — actual ceilings depend on demonstrated traffic history, model, and shared-fleet capacity.

Together's tier ladder: Build → Scale → Enterprise → Dedicated

Together AI does not use a points-or-spend-based tier promotion the way OpenAI does. There is no '$1,000 paid + 30 days' threshold to remember. Tiers are operational categories — Build is the default state for any organization with billing enabled, Scale is granted on request by support after you demonstrate a need, and Enterprise is a signed commercial contract. The promotion path is: ship on Build → start seeing 429s on a specific model → email support with traffic patterns and a target ceiling → get bumped on that model within typically 1-3 business days.

**Build** handles prototypes, eval runs, internal tools, modest production traffic. Most B2B SaaS features with <100 concurrent users sit comfortably on Build for the lifetime of the product. The headers are the source of truth — if your `x-ratelimit-remaining` is consistently above 30% of your `x-ratelimit-limit` during peak hours, you do not need to upgrade.

**Scale** handles production traffic with sustained throughput needs. The typical trigger is hitting 429s on the same model multiple times per day for a week, or running an eval batch that exceeds Build TPM on a large reasoning model (DeepSeek R1 in particular has the tightest Build ceiling of the popular models). Scale is granted per-model — getting bumped on Llama 3.3 70B does not automatically bump you on DeepSeek R1.

**Enterprise** handles regulated workloads, contractual SLAs, and very large committed spend (typically $10k+/month). It includes negotiated per-model ceilings, named account team, custom data-handling terms (no training on your data is the default across all tiers, but Enterprise gets it in writing), and the option to pre-commit to dedicated capacity reservations.

**Dedicated Endpoints** sit outside the tier ladder. They are not a tier promotion — they are a separate product where you reserve specific GPU hardware (H100, H200, or B200) and run a specific model on it for a flat per-hour fee. Rate limits on dedicated endpoints are 'whatever the hardware can do' — no shared-fleet contention, no `x-ratelimit-remaining`, no tier upgrade gate.

Per-model RPM/TPM on Build — which models are most constrained

Because Together's published limits are dynamic, the values in the table at the top of this page are observed steady-state ceilings from production Build-tier accounts in June 2026 — not numbers from a doc page. The relative pattern across models is more durable than the absolute numbers, so plan against the pattern.

**Most constrained on Build**: large reasoning models. DeepSeek R1 typically lands around 120 RPM / 80k TPM on a fresh Build account. The reason is operational, not punitive — R1's per-request compute cost is multiples of a Llama 3.3 70B request, so shared-fleet capacity for it is genuinely scarcer. Plan against this if you are running R1 for production reasoning: a single concurrent batch eval can saturate Build R1 ceilings inside one minute.

**Mid-tier on Build**: Qwen 2.5 72B Instruct Turbo around 400 RPM / 200k TPM, Mistral Large 2 around 350 RPM / 180k TPM. Comfortable for most B2B SaaS features but tight for any agent workflow with multi-turn tool calls.

**Most generous on Build**: Llama 3.3 70B Instruct Turbo at ~600 RPM / 250k TPM, Llama 3.3 8B Instruct Turbo at ~1,200 RPM / 500k TPM. The 8B is Together's highest-volume serverless model and gets correspondingly generous Build ceilings. If you are choosing between Llama 3.3 70B and 8B for a workload where the 8B handles it, the 8B's headroom can mean the difference between needing Scale or not.

**Per-org enforcement**: limits apply at the organization level, shared across all API keys. A 600 RPM ceiling on Llama 3.3 70B is the *total* across every key in your org — dev, staging, prod, every team member. Use separate orgs (each gets its own Build allotment) if you need genuine isolation between environments.

The Scale tier: how to request, what to expect, turnaround

The Scale tier is not in a settings dropdown — there is no self-serve promotion button. The process is to email Together support (or your assigned account contact, if you have one) with three pieces of information: (1) the specific model(s) you want raised, (2) your observed current ceiling from the response headers, (3) a target ceiling justified by traffic patterns (peak RPM, peak TPM, peak concurrent users, batch-job size if applicable).

Turnaround is typically 1-3 business days for popular models (Llama 3.3 70B/8B, DeepSeek R1, Qwen 2.5, FLUX.1). Longer for less-common models or unusually large asks. Scale grants are per-model — getting bumped on Llama does not bump DeepSeek. They are also dynamic in the same sense Build is: granted as a higher base ceiling that still flexes with shared-fleet capacity, not a hard contractual number.

**Typical Scale uplift on Llama 3.3 70B**: roughly 8-10x over the Build baseline — from ~600 RPM / 250k TPM to ~5,000 RPM / 2M TPM. On DeepSeek R1 the relative uplift is similar but the absolute numbers are smaller because of R1's per-request compute cost.

**What support wants to see**: a traffic graph, even a simple chart from your logs, showing sustained throughput in the relevant model. 'We are hitting 429s' alone is not enough — they want to see you are utilizing your current ceiling efficiently before they raise it. Client-side throttling at 80-90% of your ceiling with no burst spikes is the cleanest signal.

**If Scale isn't enough**: that's the Enterprise conversation, or the migration onto Dedicated Endpoints. Scale has natural upper bounds because it still rides on shared serverless fleet — for sustained throughput above the Scale ceiling, dedicated hardware is the only path.

Dedicated endpoints: when to switch, and the per-GPU-hour vs per-token math

Per Together's dedicated endpoints documentation, the three GPU options are **H100 80GB SXM at $6.49/hr**, **H200 140GB SXM at $7.89/hr**, and **B200 180GB SXM at $11.95/hr**. Billing is per-minute while the endpoint is running, scaled by GPU count if you reserve multiple GPUs per replica (2, 4, or 8) for higher throughput / lower latency.

A single H100 running Llama 3.3 70B Instruct Turbo delivers roughly **40-60 tokens/sec per concurrent stream**, scaling to **3,000-5,000 tokens/sec aggregate** at high concurrency (the exact number depends on prompt length, output length, and the batching tuning Together applies). Multiply by 60 to get per-minute aggregate: 180k-300k tokens/min on a single H100.

**The crossover math**. Llama 3.3 70B Instruct Turbo serverless pricing is **$0.88 per 1M tokens (input)** and **$0.88 per 1M tokens (output)** as of June 2026 — call it ~$0.88/1M blended. A single H100 at $6.49/hr = **$155.76/day = ~$4,673/month**. That budget covers ~5.3 billion tokens of serverless usage per month at $0.88/1M. If your sustained throughput on Llama 3.3 70B is **above ~5.3B tokens/month** (roughly 2M tokens/min sustained), a dedicated H100 is cheaper. Below that, serverless wins.

**H200 is the sweet spot for most production migrations**. The 140GB VRAM lets you run larger context windows or higher batch sizes than the H100 80GB without OOM-ing on long prompts. At $7.89/hr it is 22% more expensive per hour but commonly delivers 40-60% more aggregate throughput on Llama 3.3 70B due to the larger batch size — meaning the per-token cost on H200 is *lower* than H100 for most production workloads above the crossover point.

**B200 is for the largest models and longest contexts**. The 180GB VRAM enables running DeepSeek R1 or Llama 3.3 405B with comfortable context headroom and aggressive batching. At $11.95/hr you are paying for the headroom — not always worth it for Llama 3.3 70B, often worth it for R1 or 405B.

**Operational notes**: dedicated endpoints use the same inference API as serverless models, so application code does not change when you migrate — you swap the model identifier in the request. There is a cold-start period when an endpoint first comes online (typically 30-90 seconds for popular models), so dedicated is best for steady-state production traffic, not bursty workloads. For bursty production, run dedicated as the base layer and let overflow spill to serverless.

Embeddings, reranking, fine-tuning: separate quotas from chat

Embedding and reranking endpoints have their own rate-limit pool, distinct from chat-model limits. Hitting your Llama 3.3 70B ceiling does not throttle your BGE embeddings traffic, and vice versa. This is operationally important — if you are running a dense-retrieval system with frequent re-indexing, your embedding throughput is decoupled from your chat throughput by design.

**BGE Large 1.5 (M3-Embedding)** is Together's flagship embedding model — 1024 dimensions, multilingual, strong on retrieval benchmarks. Build ceilings land around 3,000 RPM / 1M TPM in steady state. Scale grants commonly raise to 20,000+ RPM, which is sufficient for most production re-indexing workloads.

**M2-BERT 80M 8K** is the long-context embedding option — 8,192 token context, 768 dimensions, smaller model. Higher RPM ceiling than BGE on Build because the per-request compute is lower. Good for chunking strategies that index whole documents at a time rather than 512-token windows.

**Fine-tuning** has its own rate limit separate from inference — concurrent job slots rather than RPM/TPM. The Build tier allows a small number of concurrent fine-tuning jobs (typically 1-2); Scale and Enterprise raise this. Fine-tuning bills per-token of training data and per-token of completion length; the resulting custom model deploys to either a serverless slot (subject to per-model rate limits) or a dedicated endpoint of your choice.

**Reranking** (Salesforce LlamaRank, BGE reranker) shares a pool with embeddings on Build. Useful to know if you are using a retrieval pipeline that both embeds queries and reranks candidates — the second stage shares the budget with the first.

For high-throughput embedding workloads where you want fully predictable cost, the Dedicated Endpoint product also supports embedding models (BGE, M2-BERT). At the H100 price point you get all the throughput one H100 can deliver on BGE — typically 50,000+ embeddings/sec aggregate.

Image and video model quotas: FLUX.1, SDXL, and the per-image ceiling

Image models on Together — **FLUX.1 schnell** (fast, free-tier-friendly), **FLUX.1 dev** (higher quality, non-commercial license), **FLUX.1 pro** (commercial license, highest quality), and **SDXL** — have their own rate-limit pool separate from chat and embeddings. Image limits are denominated in images per minute (IPM), not tokens per minute, because per-image compute is the binding constraint.

**FLUX.1 schnell** is the fastest of the FLUX family — 1-4 step inference, designed for low-latency generation. Build IPM ceilings typically land around 60 IPM, with Scale grants raising to 300+ IPM. The model is Apache 2.0 licensed, which is the reason most production teams use it for programmatic image generation as part of a SaaS feature (the dev variant requires non-commercial use without a license; pro carries a commercial license but at a higher price point).

**FLUX.1 pro** carries a higher per-image cost and lower Build IPM ceiling. Plan for ~20-30 IPM on Build for production accounts. Use FLUX.1 pro only for the cases where the quality difference matters — hero images, brand assets, A/B winners — and run FLUX.1 schnell for high-volume catalog generation.

**SDXL** sits on its own quota, separate from the FLUX family. Cheap per-image, lower aesthetic quality than FLUX, useful for cases where the prompt-following is more important than the polish (technical illustrations, schematics, simple product mockups).

**Video models** (when active on Together's catalog — model availability shifts as Together adds and rotates partner models) bill per second of generated video and have a separate quota from image. Plan against your specific model's published rate; video generation has dramatically higher per-request compute than image, so ceilings are correspondingly tight.

Handling 429s on Together: retry-after, headers, and steady-traffic patterns

When you exceed your dynamic ceiling Together returns **HTTP 429 Too Many Requests**, with response headers that tell you exactly what your ceiling was, how much budget you have left, and when to retry. The three relevant headers per Together's rate-limits docs are:

**`x-ratelimit-limit`** — your current ceiling for the model on the requested resource. This is the live dynamic value, not a static number. Read it on every response, not just on 429s.

**`x-ratelimit-remaining`** — how much of the ceiling you have left in the current window. When this approaches zero, throttle client-side rather than waiting for the 429.

**`x-ratelimit-reset`** — Together's suggested retry interval for the model in seconds. This replaces the standard `retry-after` semantics and is the value to back off against. Together's recommendation is to honor the header rather than hard-coding a retry delay.

**The cleanest production pattern**: read all three headers on every response, maintain a client-side estimate of remaining budget, throttle outbound requests at 80-90% of `x-ratelimit-limit`. This avoids 429s entirely in steady state. Add exponential backoff with jitter as the safety net for the cases where shared-fleet capacity drops your dynamic limit faster than your client estimate can react.

**Bursting is the worst pattern on Together**. Because the dynamic limits are partly driven by your traffic shape, bursts that hammer the ceiling actually *shrink* your subsequent allocation — the model interprets bursty traffic as less-mature integration and conservatively allocates. Steady, predictable traffic at 80-90% of your ceiling grows your allocation over time as Together's controller learns your usage shape.

**Capacity-related failures** are a separate signal — Together returns **HTTP 503 Service Unavailable** when the shared fleet itself is at capacity (not specific to your org's budget). Retry on 503 with longer backoff (30-60s); these are usually short windows during model launches or shared-fleet hot spots.

Together vs Groq vs Fireworks: model selection, speed, price

All three serve open-weights models (Llama, DeepSeek, Qwen, Mistral) on shared serverless infrastructure with the option of dedicated capacity. The differentiation is real and matters for production decisions.

**Together's specialty**: the broadest model catalog. If you are building something that needs to swap models often (eval harnesses, comparative agents, model-routing layers), Together's catalog covers the most ground — multiple Llama variants, DeepSeek family, Qwen family, Mistral, FLUX image, BGE embeddings, fine-tuning, video models when available. It is the closest thing to a one-vendor solution for open-weights production.

**Groq's specialty**: speed. Groq's LPU inference architecture delivers token throughput multiples higher than GPU-based serverless on the models it serves (Llama 3.3 70B, Llama 3.3 8B, Mixtral, DeepSeek). The trade is catalog breadth — Groq supports a curated subset of models, not Together's full menu. Use Groq when sub-200ms time-to-first-token or 500+ tokens/sec aggregate matters more than model choice. See Groq rate limits for the per-tier ceiling details.

**Fireworks' specialty**: cost optimization at scale and fine-tuning ergonomics. Fireworks generally lands at slightly lower per-token serverless prices than Together on equivalent models, and its fine-tuning + custom-deployment workflow is the most polished of the three. Use Fireworks when steady-state production cost is the binding constraint and you have a fine-tuned model to deploy. See Fireworks AI rate limits for tier details.

**The two-vendor pattern**: many production teams run Together as the primary (catalog breadth, FLUX images, embeddings, fine-tuning) and Groq as the speed fallback for latency-sensitive paths. The serverless APIs are similar enough across all three providers that overflow routing between them is a small implementation cost.

Batch inference: the rate-limit bypass for async workloads

Together's Batch Inference API runs high-volume asynchronous workloads on a separate quota pool from real-time serverless. Submit a JSONL file with up to thousands of requests; Together processes them within a service-level window (typically 24 hours, often faster); you download the results when complete.

**Why batch sidesteps the rate-limit problem**: batch jobs do not consume your real-time RPM/TPM budget. A team running an eval set of 10M prompts on Llama 3.3 70B can submit it as a batch and let it complete in the background while real-time production traffic continues unaffected. The serverless headers do not move while the batch runs.

**Pricing on batch**: discounted vs real-time serverless on most models, typically 25-50% off. The exact discount varies by model and is published per-model in Together's pricing pages.

**When to use it**: eval harnesses, training-set generation, classification at scale, weekly re-embedding of a content corpus, A/B variant generation. Any workload where 'within 24 hours' is an acceptable SLA. The pattern that beats both 'wait for Scale upgrade' and 'spin up dedicated endpoint' for one-off large jobs is 'submit the job as batch overnight'.

**What it does not solve**: latency-sensitive workloads. Interactive agents, user-facing chat, sub-second completion endpoints — these need real-time serverless or dedicated. Batch is purely an async pattern.

Sourcing and live-verify checklist

**The headers are the source of truth.** Together's official position is that there are no fixed per-model rate-limit numbers — limits are dynamic, adjust with usage, and surface through the `x-ratelimit-limit` / `x-ratelimit-remaining` / `x-ratelimit-reset` response headers (Together rate-limits docs). Any per-model number in this guide is a typical observed value from production Build-tier accounts in June 2026, not a published guarantee. Always verify your own headers before sizing production capacity.

**The dedicated endpoint prices are published.** $6.49/hr (H100 80GB), $7.89/hr (H200 140GB), $11.95/hr (B200 180GB), per Together's dedicated endpoints docs. These are stated on the page as of June 2026 and bill per-minute while the endpoint is running.

**Live-verify procedure when you budget**: (1) run a 100-request load test against your target model on Build; (2) capture the three rate-limit headers from the responses; (3) divide observed `x-ratelimit-limit` by your expected production peak — if the ratio is above 2x you are safe on Build, between 1-2x request Scale, below 1x dedicated is on the table; (4) cross-reference Together's pricing page for current per-token rates to validate the crossover math for dedicated.

**If a number in this guide looks wrong against your headers**: trust your headers. Dynamic limits mean the picture shifts with shared-fleet capacity, your traffic shape, and Together's controller tuning. The values here are calibrated to mid-2026 and intended as planning baselines.

**Why the dynamic-limits model is actually preferable for most teams**: a static published table tempts teams to design against the maximum, hit it, then sit at the maximum unable to grow without a contract. Dynamic limits grow with demonstrated steady traffic — teams that build clean integrations (steady throughput, header-aware throttling, predictable shape) often find their effective ceiling expands without ever filing a support ticket. The downside is the up-front design cost: you cannot copy a static number into your capacity plan; you have to instrument the headers.

Step-by-step: moving from Build to Scale or Dedicated

1
Instrument the three rate-limit headers in your client
Read `x-ratelimit-limit`, `x-ratelimit-remaining`, and `x-ratelimit-reset` on every response. Log them with timestamps. You cannot size against Together's dynamic limits without this telemetry — there is no doc page to substitute for it.
2
Throttle client-side at 80-90% of your live ceiling
Use a token-bucket or leaky-bucket library tuned to `x-ratelimit-limit` per-model. Never burst above the ceiling — bursts shrink your subsequent allocation, steady traffic grows it. The goal is to never see a 429 in steady state.
3
When 429s become routine, email support with traffic data
Send Together support a chart of your observed `x-ratelimit-limit` vs your actual traffic for the model(s) you want raised, plus a target ceiling justified by peak load. Scale grants typically come through in 1-3 business days for popular models.
4
Run the crossover math before reserving a dedicated endpoint
Multiply your sustained tokens/month by the serverless per-token price. Compare to $6.49/hr × 720 hr/month = $4,673/month for an H100. If sustained monthly tokens × per-token price > the H100 budget, dedicated is cheaper. The H200 ($7.89/hr = $5,681/month) often wins on per-token cost above the crossover due to larger batch sizes.
5
Migrate hot paths to dedicated, keep cold paths on serverless
Run dedicated as the base layer for your steady production traffic, let overflow spill to serverless for bursts, keep eval and async workloads on the Batch Inference API. This three-layer pattern (dedicated base + serverless burst + batch async) is the cost-optimal shape for most teams at the Scale-or-above level of traffic.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Groq rate limits→Fireworks AI rate limits→Embeddings cost calculator→Open-model-tuned prompt generator→

Frequently Asked Questions

What is the difference between Together AI's Build and Scale tier in 2026?

Build is the default tier once you enable billing — dynamic per-model rate limits sized for prototyping, internal tools, and modest production traffic (roughly 600 RPM / 250k TPM observed on Llama 3.3 70B Instruct Turbo). Scale is a support-request upgrade for production accounts with sustained throughput needs — typically 8-10x the Build ceiling on the same model, granted per-model in 1-3 business days when you can show steady utilization of your current allocation. Source: Together's rate-limits documentation and observed production-account behavior, June 2026.

When should I switch from Together AI serverless to a dedicated endpoint?

Run the crossover math. A single H100 80GB at $6.49/hr = ~$4,673/month, which buys ~5.3B tokens of Llama 3.3 70B serverless at $0.88/1M blended. If your sustained monthly token volume on a single model exceeds that, a dedicated endpoint is cheaper and removes rate-limit ceilings entirely. The H200 140GB at $7.89/hr is often the better choice above the crossover point because larger batch sizes deliver 40-60% more aggregate throughput than H100. Source: Together dedicated endpoints docs.

Why did my Llama 3.3 70B call on Together hit a 429?

Three likely causes: (1) you exceeded the dynamic `x-ratelimit-limit` for the model — read the header on every response, throttle client-side at 80-90% of it; (2) you bursted traffic, which both triggers 429s and shrinks your subsequent allocation because Together's controller interprets bursts as less-mature integration; (3) the shared serverless fleet itself is at capacity, in which case you'd see HTTP 503 rather than 429. Honor the `x-ratelimit-reset` header for retry timing. Source: Together rate-limits docs.

Are Together AI embeddings rate-limited in the same pool as chat models?

No — embedding endpoints (BGE Large 1.5, M2-BERT 80M 8K) have their own rate-limit pool separate from chat models. Hitting your Llama 3.3 70B ceiling does not throttle BGE traffic, and vice versa. Reranking endpoints share the pool with embeddings on Build. Fine-tuning has yet another separate quota (concurrent job slots rather than RPM/TPM).

Is Together AI faster than Groq for Llama 3.3 70B inference?

No — Groq is materially faster on the models it serves. Groq's LPU architecture typically delivers 4-6x the tokens/sec of GPU serverless on Llama 3.3 70B. Together's advantage is catalog breadth (more models, FLUX images, embeddings, fine-tuning, batch API), not raw speed. Many production teams run Together as the primary and route latency-sensitive paths to Groq. See our Groq rate limits reference for details.

Is Together AI cheaper than Fireworks AI on equivalent open-weights models?

Roughly equivalent on most models, with Fireworks typically a few percent cheaper at serverless per-token rates and Together more competitive on dedicated-endpoint cost-per-throughput for certain models. The bigger differentiators are catalog (Together leads), fine-tuning ergonomics (Fireworks leads), and image-model support (Together via FLUX.1 family). Pick by which differentiator matches your workload. See our Fireworks AI rate limits reference for details.

How is Together AI's fine-tuning quota structured?

Fine-tuning is rate-limited as concurrent job slots, not RPM/TPM. Build tier accounts get a small number of concurrent slots (typically 1-2), Scale and Enterprise raise this. Jobs bill per-token of training data and per-token of completion-length during training. The resulting custom model deploys to either a serverless slot (subject to per-model serverless rate limits) or a dedicated endpoint of your choice.

What is the Together AI image generation rate limit on FLUX.1?

FLUX.1 schnell typically lands around 60 IPM (images per minute) on Build, with Scale grants raising to 300+ IPM. FLUX.1 dev and FLUX.1 pro have lower Build IPM (~20-30) due to higher per-image compute. SDXL sits on its own separate quota. Image limits are per-organization, shared across all API keys. Use FLUX.1 schnell for high-volume catalog generation (Apache 2.0 license), pro for commercial brand assets, dev only for non-commercial use.

Scale tier raises the ceiling. Dedicated removes it. Prompts decide both bills.

Build tier handles prototyping. Scale handles light production. Dedicated handles real volume. Whichever you ship on, prompt structure decides the per-token bill. Our AI Prompt Generator writes Llama / DeepSeek / Qwen-tuned prompts based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →