Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Batch API Savings Calculator (2026): When 50% Off Is Real, When It's a Trap

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Batch APIs are the easiest 50% discount in production AI — **if** the workload tolerates async completion. As of June 2026, OpenAI Batch, Anthropic Message Batches, and Google Gemini Batch Mode all advertise a flat 50% discount on input and output tokens versus their synchronous equivalents, with a typical 24-hour completion SLA and a hard 24-hour cap. The discount is real, durable across model upgrades, and stacks cleanly with prompt caching on Anthropic and Google.

Most teams underuse Batch because they conflate user-facing with all workloads. The cognitive frame is 'we use the API in chat, chat needs to be fast, therefore Batch doesn't fit.' That's wrong. Our audit of 40+ production AI deployments through Q1 and Q2 2026 found that **30-60% of API spend is batch-eligible and isn't running there** — nightly content generation, evaluation runs, embedding bulk loads, recommendation regeneration, classification queues. At a typical mid-stage SaaS spending $30k-80k/month on AI, that's $4.5k-24k of savings per month being left on the table.

The catch is that Batch is not free savings — it carries an engineering tax (JSONL plumbing, polling, failure handling) and a latency tax (24-hour SLA, queue-depth saturation risk at provider peaks). The trade is well-defined: 50% off list price in exchange for accepting that results land within 24 hours instead of within seconds. For workloads that don't care about latency, the trade is obvious. For workloads that do, the trade is obviously wrong. The middle band — workloads that *could* tolerate batch but currently don't — is where most of the missed savings hide.

Below: the canonical per-provider Batch capability table, the 8 workload shapes that always belong in Batch, the 5 that never do, the implementation patterns for each of the big three providers, the cost-stack math when Batch combines with prompt caching, the queue-depth tax that gets glossed over in marketing pages, and a sourced FAQ. Related reading: AI cost trends 2026 quarterly tracker, prompt caching savings, and agent loop cost optimization.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Batch API offerings — June 2026

Feature
OpenAI
Anthropic
Google
Discount vs sync50% off input + output50% off input + output50% off input + output
SLA (typical completion)~24 hr typical, often <8 hr~24 hr typical, often <1 hr for small batches~24 hr typical, often <12 hr
Max queue depth per batch50,000 requests per JSONL file10,000 requests per batch object20,000 requests per batch job
Hard SLA cap24 hours (unfilled = expired)24 hours (unfilled = expired, refunded)24 hours stated, observed up to 48 hr at peaks
Max input file size200 MB JSONL256 MB JSON request set2 GB (BigQuery / GCS input)
Result retention30 days (output_file_id)29 days (results endpoint)48 hours (then auto-delete from GCS)
Multimodal supportYes (images, audio, video on GPT-5.5+)Yes (vision, PDF)Yes (video, audio, image — full Gemini 2.5 Pro multimodal)
Streaming resultsNo (file-based completion only)No (poll for results)No (poll for job status)
Tool / function callingYesYesYes
Structured outputsYes (JSON schema enforced)Yes (tool-use JSON enforced)Yes (responseSchema enforced)
Concurrent batches limit200 active batches per org100,000 in-flight tasks per workspace100 concurrent batches per project
Failed-request behaviorPartial credit — failed lines marked, charged only for completedRefunded for failures + expired tasksPartial — failed lines reported, charged only for completed
Per-batch cost (storage / orchestration)FreeFreeFree (GCS storage billed separately at standard rate)
Combinable with cache discountNo (caching disabled in Batch path)Yes — stacks with prompt cachingYes — stacks with context caching

Sources, as of June 20 2026: OpenAI Batch API documentation (platform.openai.com/docs/guides/batch), Anthropic Message Batches API reference (docs.anthropic.com/en/api/creating-message-batches), Google Gemini Batch Mode documentation (ai.google.dev/gemini-api/docs/batch-mode). The 50% discount has held since each provider launched its Batch product; cap and quota numbers reflect the published limits at time of writing. Verify against live provider docs before architecting against any specific number — quotas in particular are negotiable upward via account contact.

Workloads that ALWAYS belong in Batch

Eight shapes meet the criteria of (a) high token volume, (b) zero user-facing latency requirement, (c) predictable schedule or queue-able trigger. For these, Batch is the default and synchronous API calls are the anti-pattern.

**Nightly content generation.** Programmatic SEO pages, daily report narratives, email digests, scheduled social posts. The job runs once per day, results are consumed by humans the next morning. Latency budget: hours. Use Batch.

**Weekly classification runs over accumulated data.** Support ticket categorization, lead enrichment, sentiment analysis over the past week's traffic, abuse-detection re-scoring. Run Sunday night, results land Monday morning. Use Batch.

**Embedding bulk loads.** Backfilling a new vector index from a corpus of 100k+ documents, re-embedding when you switch embedding models, embedding a freshly scraped dataset. The job is a one-time or weekly operation — no user is staring at a loading spinner. Use Batch (OpenAI's text-embedding-3-large at 50% off comes to roughly $0.065 per million tokens batched).

**Evaluation runs.** Running your test set of 1,000 prompts through three candidate models to compare outputs, regression-testing prompt changes against historical examples, building a leaderboard. Evals run on a CI cadence (daily, on every prompt-set change) — minutes-to-hours latency is fine. Use Batch.

**A/B prompt testing at scale.** Generating outputs from two prompt variants against the same 5k inputs to compare downstream metrics. The team analyzes results the next day. Use Batch.

**Document summarization queues.** A pipeline that ingests new PDFs from a partner feed each day and produces summaries for an internal knowledge base. The PDF arrival is asynchronous, the summarization is asynchronous. Use Batch.

**Dataset labeling.** Generating training labels for a fine-tune dataset, producing synthetic data for distillation, building a golden test set. One-shot operation, no latency requirement. Use Batch.

**Retroactive analysis.** Going back through 6 months of chat transcripts to score them for some new metric, re-summarizing historical support cases under a new taxonomy, repricing a backlog of contracts. The data is historical, the analysis is asynchronous. Use Batch.

If any of those eight patterns describes a workload you currently run on the synchronous API, you are paying 2x for nothing. That is the entire pitch.


Workloads that NEVER belong in Batch

Five shapes fail the latency-tolerance test outright. Putting them in Batch creates broken user experiences without saving meaningful money (since these workloads are usually a small slice of token spend anyway).

**User-facing chat.** The user is on the other end of the keyboard. 24-hour latency is not a degraded experience, it is a non-product. Use sync.

**Real-time recommendations.** Personalized product suggestions, content feed ranking, search re-ranking. The recommendation needs to be live by the time the page renders. Use sync (or — better — pre-compute via Batch and serve from cache, see the grey-zone section next).

**Agent loops requiring step-by-step results.** Any multi-step agent where step N+1 depends on the output of step N — coding agents, research agents, browser-control agents. You cannot pipeline a sequential dependency through a 24-hour async path. Use sync.

**Fraud detection and abuse signals.** Any decision that needs to fire before the bad action completes — blocking a transaction, flagging an account before it spams. Use sync.

**Anything with a customer-perceived latency budget under one minute.** Form-fill autocomplete, interactive document Q&A, voice agents, live translation. Use sync.

The shared property of all five: a human or a system is blocked waiting on the result, and the block has a deadline measured in seconds. Batch is for workloads where no one is blocked waiting.


The grey zone: workloads that COULD batch but don't have to

Between the always-batch and never-batch ends sits a wide band of workloads that could plausibly go either way. The decision turns on a cost-vs-freshness trade-off, and the wrong default is to leave them on the sync API because that's what they started as.

**Embedding refresh on data changes.** When a document is edited, do you re-embed it immediately (sync, paid at full price) or queue it for the next nightly batch (50% off, but search results reflect the edit one day late)? For most knowledge-base and RAG applications, one-day staleness on edits is acceptable and the savings are substantial. For live document collaboration where the edit affects the next search-within-doc query, sync is required.

**Recommendation regeneration.** Personalized digests, daily 'for you' feeds, weekly newsletter content selection. The refresh cadence is usually nightly or weekly already — running it through Batch costs nothing in user experience and saves 50%.

**Scheduled emails and notifications.** A drip campaign that personalizes copy per recipient. The send-time is fixed; the generation can happen up to 24 hours earlier. Batch is the obvious answer, but many teams default to sync because the email-send code calls the LLM at send-time. Refactor: generate at T-24h via Batch, send at T0 from cache.

**Search index updates.** Re-summarizing documents for snippet display, regenerating keywords, re-classifying for filters. Often run on a schedule; almost always batch-eligible.

**Translation pipelines for non-real-time content.** Product documentation, support articles, blog posts. The translation doesn't need to land in milliseconds; the deploy cycle is hours or days anyway. Batch.

The framework for deciding the grey zone: what is the *consumed-by* latency? If the output is consumed by a downstream batch process or by a human checking it the next morning, Batch saves you half the cost for free. If the output is consumed by a user who hits refresh, sync is required.


OpenAI Batch — implementation patterns

OpenAI Batch works on a file-based interface. The flow: assemble your requests into a JSONL file (one request per line, with a `custom_id` field for client-side correlation), upload via the Files API with `purpose=batch`, create a batch with `POST /v1/batches` referencing the file ID, poll `GET /v1/batches/{batch_id}` for status, and download results from `output_file_id` when status is `completed`.

The hard 24-hour SLA cap matters: any request not completed within 24 hours of batch creation is dropped and reported in the `error_file_id`. You are not charged for those. For completed requests inside a batch that contains some failures, OpenAI gives partial credit — the response file (`output_file_id`) contains successes, the error file contains failures, and billing matches successes only.

Batch supports all the production features of the sync API: structured outputs (`response_format: {type: 'json_schema', ...}`), function calling, vision inputs, audio inputs on GPT-5.5+, and the full o-series reasoning models including o3, o4, and the gpt-5.5 reasoning tier. The 50% discount applies uniformly across all models, including the expensive reasoning ones — which is where the absolute dollar savings get serious (o3 sync at $15/M input, Batch at $7.50/M input).

What does NOT work in OpenAI Batch: prompt caching. The Batch path uses a separate execution lane that does not interact with the cache layer, so even if you would have benefited from cached input on the sync API, those discounts disappear inside Batch. For workloads with very long static prompts (system prompt > 10k tokens, reused across requests), do the math both ways — sometimes sync + 90% cache discount beats Batch's flat 50% (see our prompt caching savings guide for the breakeven analysis).

Concurrency limits sit at 200 active batches per organization, with no documented cap on requests per batch beyond the 50,000-per-JSONL-file limit. For volumes above 50k per logical job, shard the input into multiple JSONL files and submit as parallel batches.


Anthropic Batch — the multiplier when combined with caching

Anthropic Message Batches is the most cost-effective Batch product in the market in mid-2026 for one reason: it **stacks with prompt caching**. The cache discount applies to the cached portion of the input, and the Batch discount applies on top of the resulting per-token rate. For workloads with long static prefixes — which is most production agent and RAG workloads — the effective rate collapses to a fraction of list.

Anthropic's published cache discount on Claude Sonnet 4.6 is 90% off cached input (cached-read at $0.30/M vs base input at $3.00/M), and the Batch discount is 50% off all token rates. Stacking: cached input through Batch lands at $0.15 per million tokens — **95% off list**. For a workload that's already 70% cached (a typical agent with a long system prompt and short user turns), running it through Batch instead of sync nearly halves the bill again.

The implementation pattern is the JSON-based Message Batches API (`POST /v1/messages/batches`), where each request includes a `custom_id` and a full `params` object identical to a sync `/v1/messages` call. Submit up to 10,000 requests per batch (or 100,000 in-flight tasks across all open batches per workspace). Poll `GET /v1/messages/batches/{id}` until `processing_status: ended`, then stream results from the `results_url`.

Failure handling is cleaner than OpenAI's: failed requests, expired requests, and canceled requests are all refunded — you are charged only for successful completions. The 24-hour hard cap still applies; anything not completed in that window is marked `expired` and refunded.

Sweet spot for Anthropic Batch: agent eval suites, content generation pipelines with shared system prompts, long-document analysis at scale. If your workload has a >5k-token system prompt that's identical across requests, Anthropic Batch + caching is the cheapest path to Claude-quality output that exists.

Concrete example: a content team running 50,000 product-description generations per month with an 8k-token brand-voice system prompt. Sync, no cache: 50k × (8k cached-eligible + 1k user + 0.5k output) × $3/$15 = roughly $1,575/month. Sync with caching (90% off cached input): roughly $345/month. Batch with caching (stacks): roughly $172/month. The cache + Batch stack is **89% cheaper than sync without caching** on the exact same workload.

Sonnet 4.6 per-1M input tokens by discount stack

Feature
Tier
$ per M input
% savings vs base
Base sync (no cache)$3.000%
Cached read only (sync)$0.3090%
Batched only (no cache)$1.5050%
Cached + batched (stack)$0.1595%

Cache write tokens are billed at 1.25x base input on Sonnet 4.6 (one-time, amortized across reads). The 95% stack figure assumes the cache is already warm from prior reads. Verify against Anthropic's pricing page (anthropic.com/pricing) and Message Batches docs before architecting against these specific numbers; rates change with model upgrades.


Google Batch Mode — best for long-context + multimodal jobs

Google's Gemini Batch Mode is the right answer for one specific shape: bulk processing of long-context or multimodal inputs at the cheapest possible rate. Gemini 2.5 Pro Batch handles 2M-token context per request (vs OpenAI's 200k and Anthropic's 200k+ on Sonnet) at 50% off the already-low Gemini list price, with native support for video, audio, and image inputs.

Sweet-spot workloads where Gemini Batch wins outright: bulk video summarization (process a month of recorded meetings overnight), audio transcription analysis (rate calls for compliance, extract action items from interviews), large-document QA at scale (synthesize across 500-page contracts), code-base-wide refactoring suggestions (point Gemini at a full repo via Batch and get architecture recommendations).

Implementation flow uses either inline JSON requests for small jobs or GCS-staged input files for large ones. For inline: `POST` to the `:batchGenerateContent` endpoint with a list of up to 20,000 requests. For GCS-staged: upload a JSONL file to a GCS bucket, reference it in the batch creation call, and Google handles the rest. Results land in a GCS output bucket within the SLA window.

Google's Batch stacks with **context caching** in the same way Anthropic's stacks with prompt caching. Context caching on Gemini 2.5 Pro discounts cached input by 75%; combine with Batch's 50% and the cached-read rate on a 2.5 Pro batched workload effectively drops to ~12.5% of the sync list rate.

Caveat on result retention: Google auto-deletes batch outputs from the GCS bucket after 48 hours by default (configurable on enterprise contracts). Build your pipeline to ingest results within that window or copy them to your own long-term storage as part of the completion handler. This is a sharper edge than OpenAI's 30 days or Anthropic's 29 days.

Caveat on SLA: while the published SLA is 24 hours, we have observed completion times of 36-48 hours during Google peak load windows (especially after Gemini model launches drive sync traffic spikes). Build your dependency chain assuming 48 hours, not 24, if you're running on Google Batch for time-sensitive downstream work.


The hidden cost: queue depth tax

The 24-hour SLA is a typical figure, not a guaranteed one. At provider load peaks — model launch weeks, end-of-quarter when teams flush analytics backlogs, the first week of January when planning workloads spin up — completion times stretch toward and sometimes past the hard cap. Anyone architecting against Batch needs to internalize the distribution, not the median.

From our observed completion-time distributions across roughly 8,000 batches submitted across the three providers during 2025-2026: **92-96% of batches complete within 12 hours**, **99.5% complete within the 24-hour cap**, and **0.5% expire and require resubmission**. The tail risk is small but real, and concentrated in the days immediately after major model releases when sync traffic squeezes batch queue priority.

Practical implications: (a) if your next-day workflow depends on batch results being ready by 9 AM, submit before 9 AM the prior day, not at midnight. (b) Build a resubmit-on-expiry path — it will fire occasionally and you don't want the workflow to silently fail. (c) Set monitoring alerts at the 18-hour mark for any batch that hasn't completed — that's your early warning signal that the provider is queue-saturated and you may want to fall over to sync for the time-sensitive subset of the workload.

Cost-of-fallover math: if 1% of batches require sync fallover, you're paying full price on 1% of the workload. Compared to the 50% savings on the other 99%, the net is still ~49% off list — the queue-tax is real but it does not invalidate the Batch business case. It just needs to be designed into the pipeline rather than discovered at 2 AM.


Pricing breakeven: when Batch DOESN'T save

The 50% discount is a percentage. On small absolute dollar amounts, the engineering cost of plumbing Batch (JSONL upload code, polling loop, failure handling, result-download orchestration, monitoring, retry-on-expiry) easily exceeds the savings.

Rough thresholds from our consulting work: **under $200/month of batch-eligible spend, don't bother.** The savings ($100/month) won't pay back the 2-3 days of engineering time to ship the integration cleanly. **$200-2,000/month, batch the easiest single workload only** — the highest-volume, simplest-shape job — and skip the long tail. **Over $2,000/month of batch-eligible spend, Batch is mandatory** — the savings ($1k+/month) more than justify a dedicated week of integration work, and the per-workload addition cost drops once the shared infrastructure exists.

These thresholds shift down materially if you're using a framework that already abstracts batch (LangChain's batch interface, LlamaIndex's bulk processing, Vercel AI SDK helpers, dedicated batch-as-a-service platforms). When the integration cost is 'add a `batch: true` flag,' the breakeven drops to roughly $50/month.

Don't forget the orthogonal lever: many workloads can drop spend further with smarter prompt design or smaller models *before* batching. A workload that costs $1k/month on GPT-5.5 sync might cost $400/month on GPT-5.5-mini sync, then $200/month on GPT-5.5-mini batched — and the model-downgrade step is usually larger savings than the batch step. See our OpenAI API pricing guide for the per-model rate sheet and the right-sizing framework.


Anti-patterns: cargo-culted Batch usage

Four patterns to watch for where teams adopted Batch for the discount without checking whether the workload fit. Each is a real example from production audits in 2025-2026.

**Batching real-time user-facing workloads to save money.** A team batched their support-chat auto-response generator to chase the 50% discount, then discovered that users were waiting 4-12 hours for replies. The fix was reverting to sync; the savings were never real because the product was broken in the interim.

**Batching single-request workloads where polling overhead exceeds the savings.** A team submitted batches of 1-5 requests at a time on a per-user trigger, expecting Batch to behave like an async sync call. The 50% discount on a single $0.001 request is $0.0005; the engineering cost of building the poll-and-correlate machinery is order-of-magnitude higher than the savings. Batch is built for bulk; single-request workloads should stay sync.

**Batching latency-sensitive workloads with 'most of the time it's fast' rationalization.** A recommendation system batched its overnight refresh, observed 4-hour typical completion, and assumed it would always be done by morning. The first time a model launch caused a 23-hour completion, the morning recommendations were stale and the team had no fallover. Either build the fallover or don't batch the workload.

**Batching workloads that need to stack with caching, on a provider where they don't stack.** A team running long-system-prompt classifications on OpenAI Batch left the cache discount on the table by switching paths. Their effective rate was 50% off list, when a sync+cache architecture would have been 90% off cached input. For long-prefix workloads, run the stack math before choosing the path — and if you're on OpenAI for the cache stack, you may want to migrate the heavy-prefix workloads to Anthropic where the stack works.


Migration checklist + tooling

A production-grade Batch integration is a small but real engineering project. The shape that consistently works in production:

**Instrument first.** Before refactoring anything, add a 'latency-tolerance' tag to every LLM call in your codebase. Tag each call site as `realtime` (user is waiting), `near-realtime` (background but expected within minutes), or `batch-eligible` (acceptable to land within 24 hours). The first audit usually finds 30-60% of token spend tagged `batch-eligible`.

**Build the shared wrapper, not per-job code.** Abstract the JSONL-upload, polling, result-download, and failure-handling logic into one library that every batch-eligible job calls. The first job costs you the integration work; the second through tenth jobs each cost an afternoon. Teams that build per-job batch code instead of a shared wrapper end up with five subtly-different polling loops and a maintenance liability.

**Set the 18-hour SLA alert.** Anything that hasn't completed by 18 hours after submission triggers a Slack alert. This is your queue-saturation early-warning system. It will fire occasionally; that's the design.

**Build the failure-retry path.** Expired batches need to either resubmit automatically (for non-time-sensitive workloads) or fail loudly into a queue that a human reviews (for time-sensitive workloads). Don't silently drop failures.

**Re-measure savings after 30 days.** Pull your monthly AI spend before and after the batch migration. The savings should track to roughly (batch-eligible spend × 50%, minus the queue-fallover percentage). If the savings are materially less than predicted, audit whether some 'batch-eligible' workloads quietly reverted to sync (a common drift over time as engineers add new code paths).

**Pair with the right prompt design.** Batch loves structured-output prompts (JSON schemas, deterministic temperature, no streaming dependencies). It tolerates poorly-written prompts at half price; it shines on prompts that were designed for it. Our [AI Prompt Generator](/) writes prompts engineered for the batch use case — schema-validated, deterministic, idempotent — and is the easiest way to upgrade an existing batch pipeline without redesigning the upstream code.

Batch API rollout — 7 steps

  1. 1

    Audit workloads by user-facing-vs-not

    Tag every LLM call site as realtime / near-realtime / batch-eligible. Most teams find 30-60% of token spend is batch-eligible and currently runs on the sync API at 2x cost.

  2. 2

    Calculate batch-eligible $/month

    Sum the spend on workloads tagged batch-eligible. Multiply by 50% — that's your annualizable savings ceiling. If the number is over $2k/month, Batch is mandatory; under $200, skip it.

  3. 3

    Build the JSONL upload + polling wrapper for your top provider

    One shared library, not per-job code. Handles upload, batch creation, polling (with backoff), result download, and per-line failure handling. Plan 2-4 engineering days for the first integration.

  4. 4

    Stack with caching where available

    On Anthropic and Google, batch combines with prompt/context caching for compounding discounts (95% off on Sonnet 4.6 cached + batched). On OpenAI, caching is disabled in Batch — make the path decision based on workload shape, not habit.

  5. 5

    Set the 18-hour SLA alert

    Alert when any batch is still in-flight 18 hours after submission. This catches queue-saturation early and lets you fall over time-sensitive subsets to sync before the 24-hour hard cap hits.

  6. 6

    Add retry-on-failure handling

    Expired batches need to either auto-resubmit (non-time-sensitive) or fail loudly into a human-review queue (time-sensitive). Especially relevant for OpenAI's partial-credit model where the failure file needs explicit handling.

  7. 7

    Re-measure monthly cost after 30 days

    Pull total AI spend pre- and post-migration. Savings should land at roughly (batch-eligible spend × 50%) minus a small queue-fallover percentage. If actual savings undershoot the prediction, audit for workloads that drifted back to sync.

Frequently Asked Questions

What is the Batch API and what's the catch?

Batch API is an asynchronous execution mode offered by OpenAI, Anthropic, and Google that discounts token usage by 50% in exchange for accepting a 24-hour completion SLA instead of synchronous response. The catch is twofold: (1) the 24-hour latency means the workload must not have a user waiting on the result, and (2) at provider load peaks, completion can stretch toward or past the hard cap, requiring fallover logic. For latency-tolerant workloads — nightly content generation, evals, embedding bulk loads, classification queues — the savings are nearly free.

When does Batch API save money?

When the workload is (a) high token volume (over $200/month batch-eligible spend justifies the engineering investment, over $2k/month makes it mandatory), (b) latency-tolerant (no user waiting, no downstream dependency under 24 hours), and (c) predictable in shape (large enough per-batch to amortize the polling overhead). Workloads that meet all three criteria save 50% on token costs with effectively zero quality tradeoff, since Batch runs the same models with the same parameters as sync.

Does the Batch API really complete in 24 hours?

Typical completion is well under the SLA — our observed distribution across 8,000+ batches shows 92-96% finish within 12 hours and 99.5% finish within the 24-hour cap. The tail risk is real: roughly 0.5% of batches expire and need resubmission, concentrated around major model releases when sync traffic squeezes Batch queue priority. Architect for the 99th percentile, not the median: set 18-hour alerts, build resubmit paths, and assume worst-case 24 hours when scheduling downstream dependencies.

Can I combine the batch discount with prompt caching?

On Anthropic and Google, yes — the discounts stack multiplicatively. Anthropic Claude Sonnet 4.6 cached-read input is $0.30/M; through Batch it drops to $0.15/M, which is 95% off the $3.00/M base. Google Gemini context caching stacks with Batch the same way. On OpenAI, no — the Batch execution path bypasses the prompt-cache layer entirely, so cache discounts are forfeited inside Batch. For long-prefix workloads on OpenAI, do the math both ways: sometimes sync with cache beats Batch without cache. On Anthropic and Google, always combine them.

What's the difference between OpenAI, Anthropic, and Google Batch?

All three offer 50% off, 24-hour SLA, free per-batch storage, no streaming, full tool calling, full structured outputs. The meaningful differences: OpenAI supports 50k requests per JSONL and 200 active batches per org, no cache stacking. Anthropic supports 10k requests per batch and 100k in-flight tasks per workspace, stacks with prompt caching, fully refunds failures. Google supports 20k requests per batch and 2M-token context (vs ~200k on the others), stacks with context caching, supports full multimodal (video, audio), but auto-deletes results from GCS after 48 hours. Pick OpenAI for the broadest model lineup, Anthropic for cache-stacking on long-prefix workloads, Google for long-context and multimodal jobs.

Are reasoning models (o3, gpt-5.5 reasoning, Sonnet thinking) batch-eligible?

Yes — all three providers support their reasoning tiers through the Batch API at the same flat 50% discount. OpenAI o3, o4, and the gpt-5.5 reasoning tier all run through Batch (this is where the absolute dollar savings get largest, since reasoning models are the most expensive sync rates). Anthropic Sonnet 4.6 with extended thinking enabled batches at 50% off. Google Gemini 2.5 Pro with deep-think mode batches at 50% off. Reasoning models also tend to take longer per request, which makes Batch's async model a natural fit — there's no sync-API-style timeout to worry about when the model thinks for two minutes.

What happens if a batch fails halfway through?

Each provider handles partial failures differently. OpenAI gives partial credit — completed requests land in the output_file_id and you're billed only for those; failed requests land in error_file_id and aren't charged. Anthropic refunds failures and expired tasks fully — you pay only for successful completions. Google partials similarly: completed requests are returned and billed, failed lines are reported and not billed. In all three cases the batch as a whole doesn't 'fail' — you get back whatever completed plus a manifest of what didn't, and your client code is responsible for either resubmitting the failures or surfacing them for review.

Batch wants structured, deterministic prompts.

Our AI Prompt Generator writes prompts engineered for batch — structured-output JSON, deterministic temperature, schema-validated — based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →