By The DDH Team · Digital Dashboard Hub

Replicate Rate Limits 2026: Predictions Per Second, Concurrency & Cold Starts

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Replicate is a different beast from token-priced LLM APIs. You don't pay per token — you pay **per second of GPU time** the model is actually running, billed against the hardware class you selected (Nvidia A100, H100, L40S, T4, etc.). That changes everything about how rate limits matter. The published **600 requests/minute** ceiling on prediction creation is rarely the binding constraint. The real constraints, in order of how often they bite production teams: (1) **cold starts** — first prediction on a sleeping public model can take **30-90 seconds** before any inference begins; (2) **per-model concurrency caps** that limit how many simultaneous runs of any single model your account can have on the shared pool; (3) GPU availability spikes during launch events; and only then (4) the 600/min rate limit itself.

Replicate publishes two numbers on its rate-limits doc: **600 requests/min for prediction creation** (`POST /v1/predictions` and `POST /v1/deployments/.../predictions`) and **3,000 requests/min for all other endpoints** (status polling, listing, cancellation). Accounts in the 'low credit' state — granted credits but no payment method on file — get a much stricter **1 req/sec with a 6 req/min hard ceiling**. When you hit any of these, the API returns **HTTP 429** with a `detail` field like 'Request was throttled. Expected available in 1 second.' There is no per-tier ladder the way OpenAI has — Replicate's pricing model means you scale by adding always-on capacity or self-hosting, not by climbing a tier ladder.

Below: the canonical limits + deployment options table, eight body sections covering how to actually engineer around cold starts, GPU class selection economics, when to switch from shared pool to dedicated deployments to self-hosted Cog, and a sourcing checklist for live-verifying your account's specific ceilings. For broader image-pipeline cost work see our Midjourney cost calculator; for sibling image-API rate-limit references, DALL·E 3 rate limits and Fireworks AI rate limits.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Replicate rate limits and deployment options — June 2026

Feature	Predictions/sec	Concurrent runs/model	Cold start (typical)
Free account (low credit / no card)	~0.1/sec (6/min hard cap)	1-2 (shared pool)	30-90s (often longer)
Paid account (PAYG, shared pool)	~10/sec (600/min)	Model-dependent, usually 4-16	30-90s on cold model, 0-3s warm
Always-on dedicated deployment	Capped only by instance count	= min_instances × replicas/instance	0s (warm) — but billed per GPU-second 24/7
Self-hosted Cog (your infra)	Limited by your hardware only	Unlimited (you own the GPUs)	0s warm / your boot time cold

Source, as of June 2026: Replicate rate-limits documentation (https://replicate.com/docs/topics/predictions/rate-limits) and deployments documentation (https://replicate.com/docs/topics/deployments). The 600/min and 3,000/min ceilings on the paid public API and the 6/min low-credit cap are quoted verbatim from the docs. Per-model concurrency caps are not published as a single number — they vary by model based on GPU availability and historic load patterns. Cold-start times are model-specific and reflect typical observed ranges across FLUX.1, SDXL, Llama 3.3 70B, and similar popular models on the shared pool; very large models (70B+) can cold-start beyond 90s. Always-on dedicated deployment pricing is per-second of GPU time on the selected hardware class — see Replicate's pricing page for current GPU-class rates.

Replicate's pricing model: per-second GPU time, not per-token

Every other major inference API in this comparison set (OpenAI, Anthropic, Fireworks, Together, Groq for LLMs) bills primarily on **tokens in / tokens out**. Replicate doesn't. Replicate bills on **wall-clock seconds of GPU time** the model spends executing your prediction, multiplied by the per-second rate of the GPU class the model runs on. A FLUX.1 image that takes **3.2 seconds** on an Nvidia L40S at roughly **$0.000975/sec** costs about **$0.003** for that image. An SDXL generation that takes **6 seconds** on the same hardware costs about **$0.006**. A Llama 3.3 70B completion that runs **12 seconds** on an A100 (~$0.001400/sec) costs about **$0.017** for that call.

This changes rate-limit planning fundamentally. On OpenAI, your bill scales with output length. On Replicate, **your bill scales with how long the model is running** — which is a function of (a) input size, (b) sampler steps or generation tokens, (c) which GPU class it's on, and (d) the model's underlying efficiency. A poorly-tuned prompt that requires 50 sampler steps to converge costs 2.5x what a well-tuned prompt requiring 20 steps costs, on the exact same hardware. Steps and tokens are levers; so is GPU class selection.

For LLMs hosted on Replicate (Llama family, Qwen family, DeepSeek family, Mistral family) the bill is sometimes computed as **per output token** for the most popular hosted versions and **per second of GPU time** for the long tail. Check each model's page — the pricing block at the top of replicate.com/<owner>/<model> shows the billing unit explicitly. For image and video models the answer is almost always per-second of GPU time.

Free vs paid rate limits (and the 'low credit' trap)

Replicate doesn't have a five-tier ladder the way OpenAI does. There are effectively three states: **low-credit account** (no payment method on file, running on granted credits), **standard paid account** (payment method on file, healthy balance), and **rate-limit-increased account** (granted bespoke higher ceilings by Replicate support for a specific use case).

**Low-credit state** is the gotcha that catches teams. New signups get a small amount of free credit to evaluate the platform. Until you add a payment method, you're capped at **1 request per second with a 6 requests/minute hard ceiling**, regardless of credit balance. Replicate's docs describe this as a guardrail against accidental overspend on granted credit. Add a card before you do real testing — otherwise your 'is the rate limit too low?' debugging session will yield a misleading answer.

**Standard paid state** is what every production team operates in. You get **600 prediction-creation requests/minute** (~10/sec sustained) and **3,000 requests/minute** on all other endpoints (status polling, listing, cancellations, deployment management). These are organization-level limits. All API keys on the same account share the same budget. There's no 'team tier' or 'enterprise tier' line item on the standard pricing page — higher ceilings are negotiated bespoke through Replicate support.

**Rate-limit increase requests** are granted on case-by-case basis. The threshold is typically: legitimate production use case, predictable load profile, and ideally an existing spend pattern that demonstrates the account is real. Email support@replicate.com with your account ID, the model(s) you're calling, and the sustained predictions/sec you need to hit. Response times are 1-3 business days in our experience.

Per-model concurrency cap: the limit that's not on the rate-limits page

The 600 req/min ceiling is the published limit. The **per-model concurrency cap** is the unpublished one — and it bites more often. On the shared pool, every public model has a finite number of GPU replicas Replicate keeps warm globally for all users combined. When demand exceeds replicas, your predictions queue. When the queue exceeds a per-account fairness threshold, your `POST /v1/predictions` calls return **429** even though you're well under 600 req/min.

Concrete example. FLUX.1 [dev] is wildly popular. If you call it during a US-business-hours spike, the public pool may have **8 warm replicas** serving the entire customer base. Your 10 concurrent calls to FLUX.1 will queue behind everyone else's; calls 4-10 in your batch will likely see queueing latency that pushes wall-clock times to **20-60s** even though the actual GPU inference is **3 seconds**. There is no API field that tells you how many replicas exist or where you are in the queue.

How to find your effective concurrency cap on any specific model: instrument your client. Submit 20 simultaneous predictions on the model. Measure (a) how many succeed within 5 seconds, (b) how many queue, (c) how many 429. That's your operational concurrency budget for that model on the shared pool. Re-measure during off-peak hours (overnight US time, weekends) to see the difference. Most popular models give you **4-16 concurrent runs** on the shared pool in steady state, with substantial variance by hour.

The only way to remove the per-model concurrency cap is to **stop using the shared pool**. Either provision a dedicated deployment (your own warm capacity) or self-host the model via Cog. Both are covered in detail below.

Cold starts: why the first call takes 60 seconds (the real bottleneck)

**Cold start is the single most-asked question on the Replicate community forum**, and it dominates latency for low-traffic workloads. When a model on the shared pool hasn't been called recently, Replicate has spun it down to **zero warm replicas** to free GPUs for other models. Your first call wakes it up: pulling the container image (~10-30s on big models), loading model weights to GPU memory (~10-40s on 70B-class models or large diffusion checkpoints), warming the inference cache (~5-20s), then actually running your prediction. Total cold-start wall time: **30-90 seconds** for most popular models, and **2-4 minutes** for the largest checkpoints (Llama 3.3 70B FP16, video models, large multi-modal models).

Once warm, the same model serves subsequent predictions in **sub-second to a few seconds** depending on inference complexity. The model stays warm as long as there's traffic; idle threshold before spin-down is typically **a few minutes** but varies by model popularity (FLUX.1 stays warm continuously; an obscure custom model spins down within minutes of your last call).

**You are not billed for cold-start container/image-pull time** in the typical case — the per-second meter starts when your prediction begins executing on the GPU. But you *are* billed for the GPU seconds the model spends actually inferring, which includes the GPU-bound portion of warmup on some models. The exact billing boundary is documented per-model on the model's pricing page; the general rule is 'time the predict function is running' counts, 'time spent provisioning the container' does not. Verify on the specific model you're using by submitting one cold prediction and comparing your billed seconds to the wall-clock time.

Cold starts are **not a rate limit**, but they hit your latency budget the same way. A user-facing image-generation feature with a P95 latency requirement of 5 seconds cannot use the public Replicate pool for any model that sees less than a few requests/second sustained — the cold start will blow your SLA on every cold call. This is the single biggest architectural decision Replicate forces on production teams: either accept cold starts (async pattern with webhooks, user-facing 'generating...' loaders that show 30-90s), or provision dedicated deployments (always-warm, but you pay 24/7).

Always-on dedicated deployments: the math for when to switch

Dedicated deployments give you private, dedicated API endpoints with your own pool of warm replicas of the model. Set `min_instances=1` and the model is **always warm**. Cold starts disappear entirely for any prediction that fits within your warm capacity. The catch: you pay per-second of GPU time on the deployment's hardware, **whether or not predictions are running**. An idle deployment burns money.

**The breakeven math.** Suppose FLUX.1 on an L40S at ~$0.000975/sec. A 24/7 always-on deployment with 1 replica costs **$0.000975 × 86,400 = $84/day** = roughly **$2,530/month**. The same workload served from the shared pool at $0.003/image (3-second inference) breaks even at **~840,000 images/month** (~28k/day, ~19/minute steady). If your workload is below that volume, the shared pool plus accepting cold starts is cheaper; above that volume, the dedicated deployment is cheaper *and* faster.

For larger models the math shifts faster. Llama 3.3 70B on an A100 at ~$0.001400/sec is **$121/day** always-on, **~$3,630/month** per replica. A shared-pool call running 12 seconds at $0.001400/sec costs $0.017 per completion. Breakeven: **~214,000 completions/month** (~7,100/day, ~5/minute steady). Mid-volume LLM workloads cross this breakeven faster than image workloads.

**Dedicated deployments also let you scale.** Set `min_instances=2, max_instances=10` to keep 2 always-warm replicas for low-latency baseline plus autoscale up to 10 during spikes. Replicate handles the autoscaling based on traffic. You pay for the warm baseline 24/7, plus the burst replicas only while they're running.

**Hardware selection on dedicated deployments matters more than on the shared pool.** On the shared pool, Replicate picks the GPU; on a deployment, you pick. A100 is cheapest for most LLMs but slower than H100. H100 is roughly **2-3x faster** for many models but roughly **2-3x more expensive per second** — so cost-per-prediction is often similar, but **latency-per-prediction is sharply better on H100**. L40S is the cost-balanced sweet spot for image/video diffusion. T4 is too slow for almost any production workload but cheap enough for batch precomputes.

Webhooks: the recommended pattern for long-running predictions

Replicate's API supports two async patterns for predictions that take longer than your HTTP timeout: **polling** (`GET /v1/predictions/{prediction_id}` until `status` reaches `succeeded`, `failed`, or `canceled`) and **webhooks** (provide a `webhook` URL on prediction creation; Replicate POSTs the completed prediction body when done). For any production system serving real users, **webhooks are the canonical pattern**.

How it works. On `POST /v1/predictions`, include `webhook: 'https://your-app.example.com/replicate/callback'` in the request body. The call returns **HTTP 201** immediately with the prediction object in `starting` status. Your webhook endpoint receives a POST with the full prediction body (`status: 'succeeded'`, `output: [...]`, `metrics: {...}`) when the prediction completes — whether that's 3 seconds later (warm) or 90 seconds later (cold start). Replicate **retries failed webhook attempts**, so your endpoint must be idempotent on `prediction.id`.

Polling is the fallback when you don't control a public webhook URL (e.g., during local development, or in a mobile client that can't expose a server). The polling endpoint counts against the **3,000 req/min** budget — separate from the 600 req/min prediction-creation budget. Poll at 1-2 second intervals on warm models; at 3-5 second intervals on potentially-cold models to avoid wasting requests.

There's also a synchronous mode via the **`Prefer: wait`** header on the prediction-creation request. This blocks the HTTP response for up to **60 seconds**, returning the completed prediction inline if it finishes within that window. Use this for short-running predictions (warm models, small images) where blocking is acceptable. Do not use it for cold-start-prone models or LLMs with long outputs — 60s isn't enough headroom.

**Idempotency.** Replicate prediction IDs are generated on the server; you can also include an idempotency key on creation to deduplicate retries. For systems where webhook delivery might be retried (almost always), key your downstream side-effects on `prediction.id` so a duplicate webhook delivery doesn't double-write your database.

GPU class selection: A100 vs H100 vs L40S vs T4 pricing and performance

GPU class is the single biggest cost lever on Replicate after model selection. Replicate's hardware menu, as of June 2026, includes (cheapest to most expensive per-second): **CPU**, **T4**, **L40S**, **A40**, **A100 (40GB and 80GB)**, **H100**, and **H100 8-up** for the largest models. Specific per-second rates change occasionally — always check the live hardware pricing page when budgeting.

**T4** ($0.000225/sec ≈ $0.81/hour). The cheapest GPU. Suitable for small models, embeddings, classification, and any inference that fits in 16GB VRAM. Latency is 3-5x worse than A100 for diffusion models. Use for batch precomputes, not user-facing.

**L40S** ($0.000975/sec ≈ $3.51/hour). The sweet spot for diffusion (FLUX.1, SDXL, video models). 48GB VRAM accommodates most modern image/video checkpoints. Latency is competitive with A100 for image generation while costing meaningfully less per second.

**A100 80GB** ($0.001400/sec ≈ $5.04/hour). The workhorse for LLMs in the 7B-70B range. 80GB VRAM is enough for Llama 3.3 70B in 8-bit, Qwen2.5 family, DeepSeek-V3 distilled variants. Most LLM hosting on Replicate runs on A100.

**H100** ($0.002500/sec ≈ $9/hour). Roughly **2-3x faster** than A100 on the same model. Use when latency matters more than per-second cost — interactive agents, sub-second-target chat completions, video generation. Cost per completion is often within 20% of A100 because the faster inference offsets the higher per-second rate.

**H100 8-up** ($0.020000/sec ≈ $72/hour). Multi-GPU configuration for very large models (70B+ FP16, 405B 4-bit, video models with very long sequences). Almost always overkill unless your specific model card requires it.

**The decision rule.** For image workloads, default to L40S. For LLMs, default to A100. For latency-critical user-facing workloads, pay the H100 premium. For batch precomputes where wall-clock doesn't matter, T4. Never assume a more expensive GPU = better cost-per-prediction; for many workloads the cheaper GPU produces a similar bill because the inference takes longer.

Running Cog-packaged models on your own infrastructure

Cog is Replicate's open-source containerization tool. Every model on Replicate is a Cog package — a Docker container with a defined `predict()` interface, a `cog.yaml` spec, and pinned dependencies. The same Cog image you push to Replicate can be pulled and run on any container host with a GPU: your own Kubernetes cluster, a single EC2 GPU instance, a RunPod pod, a Lambda Labs box, a Modal app.

**Why self-host.** Two reasons dominate. **First**: cost at scale. If you're running >$10k/month of predictions on a single Replicate-hosted model, your effective per-prediction cost is mostly Replicate's margin on top of the underlying GPU rental from AWS / Lambda / CoreWeave. Self-hosting cuts that margin — typically 30-50% savings if you're already operating Kubernetes or another container orchestrator. **Second**: data residency or privacy. Self-hosted Cog models keep inference data inside your VPC. For regulated workloads (HIPAA, GDPR with strict data-locality requirements, defense contracting), self-hosting is the only path.

**Why not self-host.** The operational lift is real. You're now responsible for: GPU instance provisioning, autoscaling (Replicate's deployment autoscaler is genuinely good — replacing it isn't trivial), webhook routing, observability, queue management for burst traffic, GPU driver / CUDA version management, model checkpoint storage, A/B testing infrastructure. A team running 5+ models in production on self-hosted Cog typically needs 1 dedicated ML platform engineer; below that volume the savings rarely cover the headcount.

**Hybrid pattern.** Many production teams keep Replicate for the long tail (rarely-used models, experimentation, A/B candidates) and self-host the 1-3 highest-volume models on their own infra. Cog-packaged means you don't rewrite anything to switch — same container, different host.

Replicate vs Fireworks / Together / Groq for image and video workloads

Replicate is *the* aggregator for image and video models. Fireworks, Together, and Groq are primarily LLM-focused providers; none of them host the breadth of image/video models Replicate does. For image and video, Replicate is essentially uncontested at the hosted-API level (HuggingFace Inference Endpoints is the closest competitor but is more bring-your-own-model).

**For LLMs**, the comparison is real. Fireworks and Together host the same Llama, Qwen, DeepSeek, Mistral family that Replicate hosts, often at **higher tokens/sec** and **lower cost-per-million-tokens** for the popular SKUs. Groq beats everyone on latency for the LLMs it hosts (LPU inference, single-digit-millisecond first-token latency on Llama 3.3 70B) but has narrower model coverage. If your workload is exclusively LLM, Fireworks rate limits and Together / Groq are usually the better economics.

**Where Replicate wins.** (1) **Breadth of model catalog** — thousands of community-uploaded models including obscure fine-tunes, niche specialty models, brand-new research drops. (2) **Image and video** — FLUX.1, SDXL, SVD, AnimateDiff, all major Wan / Mochi / LTX video generators, all on day-one. (3) **Bring-your-own-model** via Cog with a frictionless deploy path. (4) **Per-second billing** which is fairer than per-token for image/video workloads where token counts are artificial constructs.

**Where Replicate loses.** (1) **Pure LLM token throughput** at scale — go to Fireworks/Together/Groq for that. (2) **Cold starts** on the shared pool — addressed via dedicated deployments but at a cost. (3) **Predictable per-call cost forecasting** — per-second billing means a slow prediction costs more than a fast one even for the same input; budgeting against monthly token volume (the OpenAI mental model) doesn't translate.

**Production-team default.** Replicate for image and video, plus the long tail of unusual models. A dedicated LLM provider (Fireworks for cost-balanced, Groq for latency-critical, Together for the largest-model coverage) for the LLM workload. Same OpenAI-compatible client patterns across all of them.

Sourcing and live-verify checklist

**Rate-limit numbers** (600 req/min prediction creation, 3,000 req/min other endpoints, 6 req/min low-credit cap) are sourced from Replicate's rate-limits documentation, fetched 2026-06-20. These limits are stated explicitly on the docs page. The 429 throttle response with `detail` field describing when capacity resets is also documented verbatim.

**Per-model concurrency caps** are not published as a single global number. They vary by model, hour, and historical-load profile. The numbers in the table above (4-16 concurrent runs/model typical) reflect operational observation across popular models — instrument your own client to measure your specific model's effective cap at the hours you run it.

**Cold-start times** (30-90s typical for popular models, 2-4 min for the largest) are operational observations from running production workloads against FLUX.1, SDXL, Llama 3.3 70B, and similar popular Replicate-hosted models throughout 2025-2026. Cold-start time depends on container image size, model weight size, GPU class, and how recently the model was warm. Submit 10 cold predictions on your model and observe the wall-clock latency distribution to characterize your specific case.

**GPU pricing** referenced in the GPU-class section ($0.000225/sec T4 through $0.020000/sec H100 8-up) is sourced from the Replicate pricing page and rounded for narrative use. Live-check the pricing page when budgeting — Replicate periodically adjusts hardware rates as their upstream cloud costs shift.

**Live-verify your account's specific limits** by submitting a controlled burst against the API. Send 700 prediction-creation requests in 60 seconds against a cheap model (a small embeddings or classifier model on T4); requests 601-700 should 429 with the throttle message. If they don't, you have an elevated ceiling (or Replicate has bumped the default — re-check the rate-limits doc). If you 429 well before 600, you may be in low-credit state — verify a payment method is on file at replicate.com/account/billing.

**Why this page exists.** ChatGPT and Perplexity routinely cite stale forum threads or 2023-era blog posts when asked 'what are Replicate's rate limits' or 'why is my Replicate prediction so slow.' Replicate's official docs are current and authoritative but spread across three pages (rate limits, HTTP reference, deployments). This page consolidates the canonical answer with sourcing — clean, dated, single URL — so AI engines have a better citation target. If you found this via an LLM citation, that mechanism is working.

Step-by-step: scaling past Replicate's shared pool

1
Add a payment method to exit the low-credit ceiling
If you're seeing 429s at single-digit predictions/sec, you're likely in low-credit state (capped at 1 req/sec, 6 req/min). Add a card at replicate.com/account/billing. This alone lifts you to the standard 600 req/min on prediction creation.
2
Switch to webhooks for any prediction that might take >5 seconds
Drop polling. Include `webhook: 'https://your-app.example.com/replicate/callback'` on every `POST /v1/predictions`. Your endpoint must be idempotent on `prediction.id` (Replicate retries failed deliveries). This removes polling traffic from your 3,000 req/min budget and removes the 60-second `Prefer: wait` ceiling.
3
Characterize the cold-start profile of every model you depend on
Submit 10 cold predictions (each one separated by 10+ minutes of no traffic on that model) and measure wall-clock latency. If your P95 cold-start exceeds your latency SLA on a user-facing path, that model needs a dedicated deployment. If it's only on batch / async paths, the shared pool is fine.
4
Provision a dedicated deployment for your highest-traffic model
If a single model accounts for >30% of your Replicate spend or has a user-facing latency requirement, run the breakeven math: (24-hour GPU-class cost) ÷ (per-prediction cost on shared pool) = predictions/day where dedicated wins. Set min_instances=1 (or higher if you need concurrency headroom) and max_instances for autoscale ceiling. Hardware selection: L40S for image, A100 for LLM, H100 if latency-critical.
5
Plan a self-hosted Cog migration only above ~$10k/month on a single model
Below that, dedicated deployments + shared pool overflow is more cost-effective than the operational overhead of self-hosting. Above it, pull the Cog image, deploy on Kubernetes / RunPod / Modal with your own GPU pool, and route via your own webhook handler. Keep Replicate for the long tail and experimentation.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

DALL·E prompt creator→Midjourney prompt builder→Midjourney cost calculator→Fireworks AI rate limits→

Frequently Asked Questions

What is a 'prediction' in Replicate terms?

A prediction is a single execution of a model — one image generation, one LLM completion, one video clip. Each `POST /v1/predictions` (or `POST /v1/deployments/.../predictions` on a dedicated deployment) creates exactly one prediction. The 600 req/min rate limit applies to how many prediction-creation requests you can submit per minute, not how many predictions can run concurrently (that's the per-model concurrency cap).

What's the real rate limit on a Replicate free account in 2026?

If your account has only granted free credit and no payment method on file, you're in low-credit state with a hard cap of 1 request/sec and 6 requests/minute on prediction creation. Add a payment method to unlock the standard 600 req/min ceiling. Source: Replicate rate-limits documentation, June 2026.

Why did my first Replicate prediction take 60 seconds?

Cold start. When a model on the shared pool hasn't been called recently, Replicate spins it down to free GPUs for other models. Your first call wakes it up: container image pull, model weights load to GPU memory, inference cache warm. This takes 30-90 seconds for popular models and 2-4 minutes for the largest LLMs and video models. Subsequent calls within the warm window complete in seconds. Eliminate cold starts entirely by provisioning a dedicated deployment with min_instances >= 1.

Can I pre-warm a Replicate model to avoid cold starts?

On the shared pool, no — there's no API to keep a model warm; idle models are spun down by Replicate to free shared GPUs. The only way to guarantee warm capacity is a dedicated deployment. Some teams pseudo-warm by submitting a low-cost prediction on a schedule (every few minutes), but this is unreliable: the model may still be evicted under load, and you're paying for the warm-up predictions. Dedicated deployments are the correct architectural answer.

When should I switch to a Replicate dedicated deployment?

Two triggers: (1) your highest-traffic model exceeds the breakeven volume where 24/7 always-on GPU cost is less than your shared-pool spend on that model (roughly 800k images/month for FLUX.1 on L40S, ~210k completions/month for Llama 3.3 70B on A100, at June 2026 prices); or (2) you have a user-facing latency SLA that the 30-90 second cold-start latency on the shared pool would violate. Either trigger by itself is sufficient justification.

Am I billed for cold-start time on Replicate?

In the typical case, no — the per-second meter starts when your prediction begins executing, not when the container starts pulling. But the GPU-bound portion of model warmup (loading weights to VRAM, warming inference cache) does count on some models as part of the predict-function runtime. Verify on the specific model you're using by comparing one cold prediction's billed seconds to its wall-clock latency. The exact billing boundary is documented per-model on the model's pricing block.

How should I handle a 429 from Replicate?

Read the `detail` field on the 429 response — it indicates when capacity resets (e.g., 'Expected available in 1 second'). Implement exponential backoff with jitter, capped at 30-60 seconds. Long-term, monitor your prediction-creation rate against the 600/min ceiling; if you're sustained above 80%, request a rate-limit increase via support@replicate.com or move high-volume models to dedicated deployments (which have their own concurrency budget independent of the shared 600/min).

Can I self-host Replicate models on my own infrastructure?

Yes. Every Replicate model is a Cog package — an open-source containerization format Replicate maintains at github.com/replicate/cog. Pull the model's Cog image and run it on any GPU-enabled container host (Kubernetes, RunPod, Modal, plain EC2 GPU instances). You're then responsible for autoscaling, webhook routing, observability, and queue management — but you cut Replicate's hosted-API margin entirely. Worth it above roughly $10k/month spend on a single model; below that, dedicated deployments are usually a better tradeoff.

Cold starts are the bottleneck. Right prompts cut the runs.

Per-second billing means every extra regen costs you 10-90s of GPU time. Our AI Prompt Generator writes image-tuned prompts that hit in 1-2 generations instead of 5-6, based on YOUR business + scene. 14-day free trial, no card.

Browse all prompt tools →