Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Anthropic Message Batches API Limits 2026: 100k Requests, 256MB Files, 24h SLA, 50% Off

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Anthropic's Message Batches API is the asynchronous path for any Claude workload that doesn't need a real-time response. Submit up to **100,000 Messages requests in a single batch**, up to **256MB total payload**, get results back within **24 hours** (most batches finish in **under 1 hour**, per Anthropic's own documentation), at a flat **50% discount on input tokens, output tokens, and prompt-cache writes**. The kicker: batches run on a separate quota pool, so a 100k-request batch does not consume any of your real-time RPM, ITPM, or OTPM budget.

Compared to OpenAI's Batch API, Claude's batch limits are materially more generous on the per-job ceiling: **100,000 requests vs OpenAI's 50,000**, **256MB vs OpenAI's 200MB**, and the 50% discount extends to prompt-cache writes (OpenAI's batch discount applies to standard input/output only). The processing SLA is the same 24 hours on both providers; Anthropic's stated typical completion is under 1 hour, which matches what production teams report.

Below: the canonical limits table, the JSONL request format with the **custom_id** matching pattern, the discount math when you stack prompt caching on top of the batch discount, error handling (individual failures don't fail the batch), and the decision tree for Message Batches vs streaming Messages vs OpenAI Batch. For the broader Claude pricing model see our Claude API cost calculator; for real-time tier limits see Claude API rate limits by tier.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Anthropic Message Batches API limits — June 2026

Feature
Limit
Value
Notes
Max requests per batch**100,000 messages**Hard ceiling per single batch submission. 2x OpenAI's 50,000 cap.
Max file size**256 MB**Total JSONL payload. Exceed and the API returns a 413 `request_too_large` error.
Processing SLA**24 hours**Hard expiry. Anthropic states most batches finish in **less than 1 hour**. Requests not processed by 24h are marked `expired` and not billed.
Results retention**29 days**Measured from batch `created_at`, NOT from `ended_at`. After day 29, batch metadata persists but results are unavailable for download.
Discount**50% off**Applied to input tokens, output tokens, AND prompt-cache writes. Stacks with prompt-cache read discount.
Real-time quota impact**None**Batches consume a separate quota pool. A 100k-request batch does not eat into your Messages API RPM, ITPM, or OTPM.
Min tier**Tier 1**Available from the lowest paid tier. No usage-tier minimum gates batch access on Anthropic the way Tier 5 unlocks higher real-time ceilings.
Cache-write discount**50% off** (stacks)Prompt-cache write tokens billed at 1.25x base rate (5m cache) or 2x base rate (1h cache), then 50% off via batch. Net: write cache at 0.625x or 1.0x base.

Source, as of June 2026: Anthropic's batch-processing documentation (https://docs.anthropic.com/en/docs/build-with-claude/batch-processing) and Message Batches API reference (https://docs.anthropic.com/en/api/creating-message-batches), both fetched 2026-06-20. The 100,000-request and 256MB limits are stated verbatim. The 24-hour SLA, 29-day retention, and 50% discount are stated verbatim. The 'most batches finish in under 1 hour' figure is Anthropic's stated typical performance — actual completion depends on current demand and your request volume. The cache-write discount stacking is derived from Anthropic's stated 'pricing discounts from prompt caching and Message Batches can stack' guidance plus standard cache-write multipliers.

What Message Batches is — and why the separate quota pool changes the math

The Message Batches API is Anthropic's asynchronous bulk processing endpoint, exposed at `POST /v1/messages/batches`. You submit an array of Messages requests as a single payload, each tagged with a developer-supplied `custom_id`. Anthropic processes them in parallel on its own infrastructure, then makes the results available for download as a JSONL file from a `results_url` once the batch ends. Each individual request inside the batch uses the same parameters as a synchronous Messages call — same `model`, same `max_tokens`, same `messages`, same `system`, same `tools`. The only difference is the wrapper and the processing pattern.

The architecturally important detail: **batches run on a separate quota pool from real-time Messages**. This is not the case on every provider — some treat batch and real-time as a shared bucket and just discount the batch path. Anthropic gives batches their own rate-limit pool covering both batch HTTP requests and the in-flight request count inside batches. This means you can run a 100,000-request batch in the background while serving real-time traffic at your full per-tier RPM/ITPM/OTPM ceiling, with no interaction.

The use cases this unlocks at scale: **large-scale evaluations** (run 10,000 test cases through three model variants in a single batch each), **content moderation backlogs** (classify a week's user-generated content overnight at half price), **dataset enrichment** (summarize, tag, or extract structured data from a corpus), **synthetic data generation** (produce training examples at scale), **bulk transformations** (rewrite, translate, or reformat documents in bulk). For any workload where 'I need the answer within 24 hours' is acceptable, the batch path is the right answer — half the price, and your real-time capacity is untouched.


The JSONL request format and the custom_id pattern

Each batch is an array of `requests` objects, where each request has two fields: a `custom_id` string and a `params` object containing standard Messages API parameters. The `custom_id` must be **1–64 characters** matching `^[a-zA-Z0-9_-]{1,64}$` — alphanumeric, hyphens, and underscores only — and must be **unique within the batch**. Anthropic uses it to match results back to your input requests.

Why the custom_id matters: **results may be returned in any order**, and frequently are. A batch with 100,000 requests typically does not return results in submission order — Anthropic's batch processor distributes work across its inference fleet, and the result file is assembled as requests complete. The only reliable way to match a result back to its source request is the `custom_id`. Anthropic's documentation calls this out explicitly: 'always use the `custom_id` field' for matching.

Production custom_id patterns we see most often: (a) **row identifier from your source data** — `customer_12847`, `doc_8a3f9c`, `eval_case_0042`; (b) **batch + index** — `batch20260620_0001` through `batch20260620_0099999`; (c) **content hash** — first 16 chars of a SHA-256 of the prompt, for deduplication across batches. The pattern doesn't matter to Anthropic — what matters is that you can map the custom_id back to the source row in your application without doing a fuzzy match on the prompt text.

The full request payload is JSONL — one JSON object per line, no commas, no surrounding array brackets when submitted via file upload. When submitting inline via the `requests` array parameter, it's a standard JSON array. Either way, the per-request structure is the same: `{ "custom_id": "...", "params": { "model": "claude-sonnet-4-6", "max_tokens": 1024, "messages": [...] } }`.


The 100k-request + 256MB-payload ceilings — what hits first

Anthropic's documentation is explicit: a Message Batch is limited to **either 100,000 message requests OR 256MB in size, whichever is reached first**. In practice, which one binds depends entirely on per-request payload size.

**Small prompts, large request count**: 100,000 short classification requests at 1KB each is ~100MB — well under the 256MB ceiling, so the 100k request count binds first. This is the typical case for content moderation, structured extraction from short snippets, and simple classification.

**Long contexts, small request count**: a batch of long-document summarization tasks at 200KB per request (about 50k input tokens with the system prompt) hits the 256MB ceiling at ~1,280 requests — way under the 100k count limit. For long-context batches, the MB cap is the binding constraint, not the request count.

**The break-even**: 256MB / 100,000 requests = **~2.62KB per request average**. Above 2.62KB per request average payload, the MB ceiling binds first; below, the request-count ceiling binds first. Plan accordingly.

**Workarounds when you hit either ceiling**: split into multiple batches and submit in parallel. Anthropic's batch quota pool generally allows multiple in-flight batches per organization (the exact number scales with your tier — check your per-tier batch ceilings in the Anthropic Console). For a 1M-request workload, that's ten batches of 100k each. For a multi-gigabyte JSONL payload, split by row count to stay under 256MB per batch.


The 24-hour SLA — how to estimate actual completion time

The headline is **24 hours**, but it's a ceiling, not a typical. Anthropic states in its documentation that 'most batches finish in less than 1 hour.' Production teams running standard-size batches (10k-100k requests with normal-length prompts) consistently report completion in the 5-45 minute range. Long-context batches and very large batches lean closer to the 1-hour mark; demand-driven slowdowns can push individual batches to 2-6 hours.

What controls actual completion time: (a) **per-request output length** — a batch where every request generates 8,000 output tokens takes far longer than a batch generating 200; (b) **current demand on Anthropic's batch infrastructure** — explicitly called out in the docs as a slowdown factor during peak periods; (c) **your organization's batch in-flight limits** — higher tiers process more concurrently; (d) **model choice** — Haiku batches finish faster than Opus batches with equivalent payloads.

**Hard expiry at 24 hours**: requests that haven't been processed by the 24-hour mark are marked with status `expired`. You are **not billed for expired requests**. The batch as a whole transitions to status `ended` with the expired count populated in the `request_counts` object.

**Practical planning**: budget the 24-hour SLA into your pipeline as the worst case, but design for the typical 1-hour completion. For nightly batch jobs, submit by 11pm to virtually guarantee results before morning. For evaluation runs feeding into a same-day decision, submit early in the morning and plan to wake to a 1-3 hour result, not a 24-hour result.

**How to know when a batch is done**: poll the `GET /v1/messages/batches/{id}` endpoint for `processing_status`. Three values: `in_progress` (still working), `canceling` (you initiated cancellation), `ended` (terminal state — success, errors, cancellation, or expiry are all reflected in the `request_counts` object). Anthropic recommends polling every 60 seconds for typical workloads.


The 50% discount stacks with prompt caching — the math

The flat 50% discount on Message Batches applies to **input tokens, output tokens, AND prompt-cache write tokens**. This is broader than the OpenAI Batch discount, which covers standard input/output only. The stacking matters when you have shared context across batch requests — a long system prompt, a reference document, a tool schema, an example bank.

**The cache-write math**: standard Claude prompt-cache writes are billed at **1.25x the base input rate for 5-minute cache** and **2x base input rate for 1-hour cache**. Apply the 50% batch discount: 5m cache writes cost **0.625x base input** and 1h cache writes cost **1.0x base input**. For a workload that writes the cache once and reads it across 50k requests, the 1h cache write effectively pays for itself in 2-3 cache reads.

**The cache-read math**: cache reads are billed at **0.1x base input rate** (a 90% discount) at the synchronous price. Apply the 50% batch discount: cache reads in batch cost **0.05x base input** — a 95% effective discount on the cached portion of your prompt. For a Sonnet 4.6 batch with a 50k-token system prompt cached and a 200-token per-request user message, ~99.6% of input tokens get the cache-read discount, meaning effective input cost for that batch is roughly **0.06x base** — a ~94% all-in discount on input.

**Worked example**: 100,000 batch requests on Claude Sonnet 4.6, each with a 30,000-token shared system prompt (cached, 1-hour duration) + 500-token user message + 800-token output. Sonnet 4.6 base prices: $3/1M input, $15/1M output.

**Without batch, without cache**: 100k × (30,500 input × $3/1M + 800 output × $15/1M) = 100k × ($0.0915 + $0.012) = **$10,350**.

**With batch, without cache**: 50% off the above = **$5,175**.

**With batch + 1h cache (write once, read 99,999 times)**: cache write = 30,000 × $3/1M × 2.0 × 0.5 = $0.09 (one-time). Cache reads = 99,999 × 30,000 × $3/1M × 0.1 × 0.5 = $449.99. User message input = 100k × 500 × $3/1M × 0.5 = $75. Output = 100k × 800 × $15/1M × 0.5 = $600. **Total: ~$1,125** — roughly **89% off the no-batch, no-cache baseline**, or roughly **78% off the batch-only path**.

The takeaway: Message Batches alone is 50% off. Stack prompt caching with 1h duration on a workload with shared context, and you're at roughly 25% of the unoptimized synchronous bill on the same workload.


29-day results retention — why this matters for retry strategy

Batch results are downloadable for **29 days after the batch was created** — note: from `created_at`, not from `ended_at`. A batch submitted on June 1 has results available through June 30 regardless of how long processing took. After day 29, the batch metadata persists (you can still query its existence and final counts), but the `results_url` returns no downloadable content.

**Why this matters for retry strategy**: if you process 5% of failed-individual-request rows in a follow-up batch, you have a 29-day window to pull the failed rows from the original results file, build a retry batch, and submit it. After day 29, the original failures are unrecoverable from Anthropic's side — you need your own copy of the input batch to retry.

**Production pattern**: download and persist the full results JSONL to your own object storage (S3, GCS, R2) immediately after each batch ends. Do not rely on the 29-day window for any retrieval — treat it as a safety net, not a primary store. This pattern survives any future reduction in retention, and gives you the ability to retry from your own data warehouse at any time without making any additional API calls.

**29 days is generous compared to peers**. OpenAI's Batch API retains results for 30 days, broadly similar. Provider-level retention is not a long-term archive anywhere; build your own retention from day one.


Error handling — failed individual requests don't fail the batch

Each request in a batch resolves independently into one of four terminal result types: `succeeded`, `errored`, `canceled`, or `expired`. The batch as a whole reaches status `ended` once every request has resolved. **A failed individual request never fails the batch**; the partial results are always available.

**Result types and when each appears**:

**`succeeded`**: the request completed and a Messages-API-shaped response is in the result. Billed at the 50% batch rate.

**`errored`**: the request hit an error — validation error (`invalid_request_error`), authentication, permission, server error, or model overload. The result row contains the standard Anthropic error object with `error.type` and `error.message`. **You are billed for errored requests that reached the model** (most server errors) and not billed for requests rejected before reaching the model (most validation errors).

**`canceled`**: you canceled the batch (via `POST /v1/messages/batches/{id}/cancel`) before this request was sent to the model. **You are not billed** for canceled requests.

**`expired`**: the 24-hour processing window elapsed before this request was sent to the model. **You are not billed** for expired requests. Most often happens during demand spikes on Anthropic's batch infrastructure, where some tail-end requests in a large batch don't get processed in time.

**Retry pattern for errors**: stream the results JSONL, filter for `result.type === 'errored'`, extract the `custom_id` and the original `params`, build a retry batch. Validation errors require fixing your input first; server errors usually succeed on retry; overload errors should be retried with exponential backoff (don't immediately resubmit). Most production teams build a single retry pass and accept any second-pass failures as final.


Message Batches vs streaming Messages API — when to use which

**Use the streaming Messages API when**: the response is user-facing (a chat reply, an agent action, an interactive completion); the workload is bursty and small (a few requests per minute, sporadic); time-to-first-token matters (you want to start rendering before the full response is ready); you need tool use with multi-turn back-and-forth (Anthropic supports batched tool use, but multi-turn flows where each turn depends on the previous tool result are awkward to express as a single batch request).

**Use Message Batches when**: the workload is asynchronous (evaluations, content moderation backlogs, data enrichment, synthetic data generation, bulk classification, bulk translation); cost matters (50% off is meaningful at any volume above a few thousand requests/day); you want to protect real-time capacity (a large background job shouldn't compete with user-facing traffic for ITPM); you have shared context across requests (the prompt-cache write discount stacking is most valuable here).

**Decision tree**: (1) Does the user wait for the response in real time? **Yes** → streaming Messages. **No** → next question. (2) Is the volume above ~1,000 requests/day or are unit economics a constraint? **Yes** → Message Batches. **No** → streaming Messages is fine for the simplicity. (3) Does the workload need shared context across many requests? **Yes** → Message Batches with prompt caching + 1h duration for maximum discount stacking. **No** → Message Batches at the flat 50% discount.

**Mixed-mode is normal**: most production teams use both. Real-time agent and chat flows on streaming Messages; nightly eval runs, monthly content audits, and dataset enrichment on Batches. The two paths share the same model identifiers and prompt formats, so code can be written once and dispatched to either.


Message Batches vs OpenAI Batch API — head-to-head

The two providers' batch products are similar in shape but differ on every operational ceiling. Anthropic's product is more generous on per-batch capacity; both ship a 50% discount; Anthropic's discount applies to more token types.

**Per-batch request cap**: Anthropic **100,000** vs OpenAI **50,000**. Anthropic accepts twice the requests per single batch — meaningful for very large evaluation runs where coordinating multiple batches is operational overhead.

**Per-batch payload cap**: Anthropic **256MB** vs OpenAI **200MB**. ~28% more room on Anthropic for long-context payloads.

**Processing SLA**: both **24 hours**. Anthropic stated typical completion is **under 1 hour**; OpenAI's stated typical is similar (most batches complete in well under 24h). Real-world variance on both providers is demand-driven.

**Discount**: both **50% off** standard input/output. Anthropic extends the 50% to **prompt-cache write tokens**; OpenAI's batch discount is input/output only. For workloads with shared context, this stacking advantage materially favors Anthropic.

**Results retention**: Anthropic **29 days** from creation, OpenAI **~30 days** from creation. Effectively equivalent.

**Partial completion**: both providers let you download partial results for completed requests once the batch ends, regardless of how many individual requests failed. Both isolate per-request failures from batch-level success/failure.

**Quota pool**: Anthropic's batches use a **separate rate-limit pool** that does not consume real-time RPM/ITPM/OTPM. OpenAI's Batch API also uses a separate token quota independent of real-time rate limits — equivalent posture.

**Minimum tier**: Anthropic **Tier 1** (any paid account). OpenAI **Tier 1** for Batch access, with batch-specific token limits scaling by tier. Equivalent posture.

**When to pick which**: pick the batch product that matches the model you want to run. If you're on Claude Sonnet 4.6 or Opus 4.7 for the task quality, Anthropic Batches is the right path. If you're on a GPT-class model for cost or capability reasons, OpenAI Batches. The operational mechanics are similar enough that the model choice should drive the batch-API choice, not vice versa. For a deeper comparison see OpenAI Batch API limits.


Sourcing and live-verify checklist

Every number in this guide is sourced from Anthropic's official documentation as of 2026-06-20: docs.anthropic.com/en/docs/build-with-claude/batch-processing (the user guide) and docs.anthropic.com/en/api/creating-message-batches (the API reference).

**Verbatim from the user guide**: 'A Message Batch is limited to either 100,000 Message requests or 256 MB in size, whichever is reached first.' 'Most batches completing within 1 hour.' 'Batch results are available for 29 days after creation.' 'All usage is charged at 50% of the standard API prices.' 'The pricing discounts from prompt caching and Message Batches can stack.'

**Verbatim from the API reference**: `custom_id` 'Must be 1 to 64 characters and contain only alphanumeric characters, hyphens, and underscores (matching `^[a-zA-Z0-9_-]{1,64}$`).' `processing_status` values: `"in_progress"`, `"canceling"`, `"ended"`. Per-request result types: `succeeded`, `errored`, `canceled`, `expired` with the billing rules noted in the error-handling section.

**Live-verify when you budget**: open the batch-processing docs and confirm the 100k/256MB/24h/29-day/50% numbers have not changed. Anthropic's batch product has been stable on these ceilings since launch, but providers do adjust limits — verify before committing capacity plans.

**Live-check your account's batch ceilings**: the Anthropic Console at console.anthropic.com shows your organization's current batch rate limits (concurrent batch count, in-flight request count). These scale with your usage tier independent of the per-batch ceilings documented here.

**Why this page exists**: Anthropic's docs are excellent but spread across two main URLs (user guide + API reference) plus several adjacent pages (rate limits, prompt caching, supported features). When ChatGPT and Perplexity get asked 'what are the Message Batches limits' they cite whichever fragment surfaces first. This page consolidates the canonical numbers in one URL — dated, sourced, single-source-of-truth — so AI engines have a clean citation target.

Step-by-step: shipping your first Message Batch

  1. 1

    Build the JSONL — one request per line, unique custom_id on each

    Construct your input as a JSONL file (one JSON object per line) or as an inline `requests` array. Each object needs a `custom_id` (1-64 chars, alphanumeric + hyphens + underscores) and a `params` object containing standard Messages parameters: `model`, `max_tokens`, `messages`, plus any optional fields (`system`, `tools`, `tool_choice`, `temperature`). Confirm total payload is under 256MB and request count is under 100,000 before submission.

  2. 2

    Submit to POST /v1/messages/batches

    POST the batch to `https://api.anthropic.com/v1/messages/batches`. Anthropic returns a Message Batch object with an `id` (format `msgbatch_...`), initial `processing_status: "in_progress"`, and an `expires_at` timestamp exactly 24 hours after creation. Store the batch ID and `created_at` in your own database for the 29-day retention window calculation.

  3. 3

    Poll GET /v1/messages/batches/{id} until processing_status is 'ended'

    Poll every 60 seconds for typical workloads (or every 10-15 seconds if you expect sub-5-minute completion). When `processing_status` transitions to `ended`, the `request_counts` object is finalized with counts for `succeeded`, `errored`, `canceled`, `expired`, and `processing`. The `results_url` field becomes populated.

  4. 4

    Stream the results JSONL from results_url

    Download the results file by streaming, not by loading it all into memory — a 100k-request batch can produce a multi-GB results file. For each line, parse the JSON, check `result.type` (`succeeded` / `errored` / `canceled` / `expired`), and dispatch on `custom_id` back to your source row. Results are NOT in submission order — always match on custom_id.

  5. 5

    Persist results to your own storage and build the retry batch for errored rows

    Upload the full results JSONL to S3/GCS/R2 immediately — do not rely on the 29-day retention window for primary storage. Filter for `result.type === 'errored'` and `result.type === 'expired'`, build a retry batch from the original `params`, and resubmit. Expect 90%+ of errored requests to succeed on retry; treat second-pass failures as final and surface to your data team.

Frequently Asked Questions

Does Anthropic's Message Batches API eat into my real-time ITPM or RPM?

No. Batches run on a separate quota pool from the real-time Messages API. A 100,000-request batch does not consume any of your per-tier RPM, ITPM, or OTPM budget. This is one of the key architectural advantages of the batch path — you can run a large background workload without throttling your user-facing traffic. Anthropic does have separate batch-specific rate limits (concurrent batches, in-flight request count) that scale with your usage tier; check the Anthropic Console for your current ceilings.

What is custom_id for in a Message Batch?

It's a developer-supplied string (1-64 chars, alphanumeric + hyphens + underscores, matching `^[a-zA-Z0-9_-]{1,64}$`) that Anthropic uses to match results back to your input requests. Critical because results are returned in non-deterministic order — a batch with 100,000 requests almost never returns results in submission order. Use a meaningful pattern like your source row ID, a content hash, or a `batch_date_index` format so you can map results back to your application data without fuzzy-matching the prompt text.

Can I cancel an Anthropic Message Batch after submission?

Yes. POST to `/v1/messages/batches/{id}/cancel` and the batch transitions to `processing_status: "canceling"`, then to `ended` once cancellation is finalized. Requests that already completed before cancellation are billed at the standard 50% batch rate and their results are available in the partial results file. Requests that hadn't yet been sent to the model are marked `canceled` and are not billed.

Does prompt caching work with Message Batches?

Yes — and the discounts stack. The Message Batches API supports prompt caching, with both cache-write and cache-read tokens further discounted by the 50% batch rate. Cache hit rates in batches range from 30% to 98% depending on traffic patterns (Anthropic's own range), since concurrent batch processing means cache hits are best-effort rather than guaranteed. For batches with shared context, Anthropic recommends the 1-hour cache duration for better hit rates — a 5-minute cache often expires during longer batch processing.

Are vision and tool use supported in Message Batches?

Yes. The Message Batches API supports nearly all features of the synchronous Messages API: vision (image inputs), tool use (including server tools like web search, web fetch, code execution, MCP connectors), extended thinking, structured output via JSON schemas, and prompt caching. A small set of parameters is not supported — primarily `stream` (batch responses are returned as a file, not streamed), and a handful of beta features. The full exclusion list is in Anthropic's batch processing docs.

What happens if some requests in my batch fail?

Individual request failures do not fail the batch. Each request resolves independently into one of four result types: `succeeded`, `errored`, `canceled`, or `expired`. The batch as a whole reaches `processing_status: "ended"` once every request resolves. Partial results are always available — you can download the full results JSONL and see exactly which requests succeeded and which failed. You are billed for `succeeded` and most `errored` requests (those that reached the model); not billed for `canceled` (you cancelled) or `expired` (24h window elapsed).

How do I retry failed rows from a Message Batch?

Stream the results JSONL, filter for `result.type === 'errored'` and `result.type === 'expired'`, extract the `custom_id` from each failed row, look up the original `params` in your source data, and submit a new batch containing just the retry requests. Most production teams run a single retry pass and accept second-pass failures as final. Validation errors require fixing the input first; transient server errors usually succeed on retry; overload errors should be retried with exponential backoff to avoid immediately re-saturating the queue.

Should I use Message Batches or streaming Messages for an evaluation run?

Message Batches, almost always. Evaluation runs are asynchronous by nature — you're benchmarking a model across hundreds or thousands of test cases and waiting for aggregate metrics, not a single real-time response. You get 50% off, your real-time API capacity stays free for production traffic, and a shared system prompt (eval rubric, reference document, scoring guidelines) gets the stacked prompt-cache discount. The only reason to run an eval on streaming Messages is if it's tiny (under ~100 test cases) and you want results in minutes rather than tens of minutes.

Batches halve the bill. Caching halves it again.

Message Batches alone get you 50% off. Stack prompt caching on top and the same workload runs at ~25% of streaming cost. Our AI Prompt Generator writes Claude-tuned prompts with cache-anchor up top + batch-shaped output, based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →