Single-request tokens per second is the wrong metric for batch workloads. A model that streams at 210 tok/s per request can deliver tens of thousands of tokens per second in aggregate when you fan out across concurrent requests — and that aggregate number is what determines whether you finish a 100M-token job in an hour, a night, or a week. Designing for throughput is a different exercise than designing for latency, and the optimal model, provider, and concurrency level often flip when you switch goals.
Worked example: a Gemini 2.5 Flash deployment running 200 parallel requests, each streaming at the published 210 tok/s median, yields roughly 42,000 tok/s of aggregate output throughput — about 151M tokens per hour. The per-request latency does not change (each user still waits ~1.25s end to end for a 200-token answer); what changes is how much work the API tier processes in a given wall-clock minute. The same Flash tier serving a synchronous chat app might only see 5-20 concurrent requests at peak; the same tier behind a batch classification job might sustain 200-500 concurrent requests for hours.
Concurrency sizing for batch jobs is bounded by provider rate limits, not by per-call speed. As of June 2026, the OpenAI tier-5 limit on gpt-5.4-mini is 30,000 RPM and 150M TPM; the Anthropic tier-4 limit on Haiku 4.5 is 4,000 RPM and 400k input TPM with a separate output TPM cap; the Google Vertex tier on Gemini 2.5 Flash sits at 2,000 RPM per project per region with quota increases up to 10,000 RPM on request. The right concurrency for a batch job is the number that saturates whichever of these limits you hit first — usually TPM, not RPM. A useful rule of thumb: target_concurrency = TPM_limit / (avg_tokens_per_request × requests_per_second_per_worker). For most flagship-tier APIs, 50 to 200 concurrent workers saturates the limits without sustained 429s.
Diminishing returns set in past concurrency 200 on most providers. Above that point, you start hitting TPM caps and seeing rate-limit responses, retry storms, and tail-latency inflation as the provider's internal queues fill. Adding more workers does not increase throughput — it just increases the failure rate. The fix is not more concurrency but multi-region deployment (a US-East and a US-West project each gets its own quota), multi-key fanout where the provider's terms allow it, or moving the spillover to a second provider with a cross-provider router. Real production batch pipelines at scale almost always run multi-region and often multi-provider for exactly this reason.
Specialty providers (Groq, Cerebras, SambaNova) excel on single-request speed but have lower aggregate throughput than commodity inference clouds because they run on smaller fleets with fewer concurrent slots. Groq's published per-key concurrency limits sit around 30-60 concurrent requests on most models at standard tiers; Cerebras is similar. Their per-request speed is 10-20x faster, but their concurrent slot count is often 5-10x lower. The aggregate number can still win for medium-sized jobs (10M-100M tokens) but the math flips for very large jobs (1B+ tokens) where commodity inference at higher concurrency wins on total wall-clock time and almost always on cost.
The right answer for 'I need to process 1M items overnight' is almost always a Batch API, not synchronous high-concurrency. OpenAI Batch, Anthropic Message Batches, and Google Vertex batch prediction all accept a JSONL file, process it within a 24-hour SLA, and charge 50% of the synchronous rate. For a 1M-item classification job at 1,500 input tokens and 50 output tokens per item, that is 1.55B tokens of work. On gpt-5.4-mini synchronous at $0.75/$4.50 per 1M tokens, the bill is about $1,350; on the Batch API equivalent, it is about $675 and you skip the rate-limit engineering entirely. The 24-hour window is not a constraint when the job is genuinely overnight work.
The right answer for 'I need to process 1M items in 1 hour' is high-concurrency synchronous, but only with a rate-limit-aware client and multi-region failover. The minimum throughput requirement is roughly 280 items per second sustained, which at 1,550 tokens per item means ~430k tokens per second of aggregate work — that requires saturating multiple TPM tiers across regions or providers. Practical architecture: a queue (SQS, Pub/Sub, or Redis streams) draining into a worker pool of 150-300 concurrent processes per region, each running an exponential-backoff client that re-routes on 429 to a secondary region or provider, with a circuit breaker that pulls a region out of rotation if its error rate exceeds 5%. The cost premium over Batch is 2x on rates plus the engineering time, but you get the result in an hour instead of a day.