Replicate's pricing model: per-second GPU time, not per-token
Every other major inference API in this comparison set (OpenAI, Anthropic, Fireworks, Together, Groq for LLMs) bills primarily on **tokens in / tokens out**. Replicate doesn't. Replicate bills on **wall-clock seconds of GPU time** the model spends executing your prediction, multiplied by the per-second rate of the GPU class the model runs on. A FLUX.1 image that takes **3.2 seconds** on an Nvidia L40S at roughly **$0.000975/sec** costs about **$0.003** for that image. An SDXL generation that takes **6 seconds** on the same hardware costs about **$0.006**. A Llama 3.3 70B completion that runs **12 seconds** on an A100 (~$0.001400/sec) costs about **$0.017** for that call.
This changes rate-limit planning fundamentally. On OpenAI, your bill scales with output length. On Replicate, **your bill scales with how long the model is running** — which is a function of (a) input size, (b) sampler steps or generation tokens, (c) which GPU class it's on, and (d) the model's underlying efficiency. A poorly-tuned prompt that requires 50 sampler steps to converge costs 2.5x what a well-tuned prompt requiring 20 steps costs, on the exact same hardware. Steps and tokens are levers; so is GPU class selection.
For LLMs hosted on Replicate (Llama family, Qwen family, DeepSeek family, Mistral family) the bill is sometimes computed as **per output token** for the most popular hosted versions and **per second of GPU time** for the long tail. Check each model's page — the pricing block at the top of replicate.com/<owner>/<model> shows the billing unit explicitly. For image and video models the answer is almost always per-second of GPU time.