By The DDH Team · Digital Dashboard Hub

Fine-Tuning Cost by Model (2026): Per-Token Pricing Across 20+ Models

Fine-tuning costs in 2026 vary by 30x across models and platforms — from $0.40 per 1M training tokens (Llama 4 8B LoRA on Together) to $120+ per 1M training tokens (GPT-5 full fine-tune on OpenAI). The math is straightforward once you separate training cost, inference markup, and deployment hourly cost. This page collects per-token rates and per-hour serving floors for every major hosted fine-tunable model in June 2026, with a worked example for each. Use the table below to estimate your specific spend, and the calculator below for live math.

By DDH Research Team at Digital Dashboard Hub·Updated June 21, 2026

Browse all 40+ free prompt tools

Fine-tuning a model in 2026 has three cost components: (1) the training run itself, charged per training token; (2) the per-token inference markup on the fine-tuned model versus the base model; and (3) any deployment hourly cost — the per-hour floor for keeping the fine-tuned model warm and ready to serve. Different vendors structure these three costs differently, and the combination can move 5-10x for the same training run depending on platform choice.

This page collects June 2026 prices for every major hosted fine-tunable model and platform, with a worked example for a standardized training run (5,000 examples × 1,500 tokens average × 3 epochs = 22.5M training tokens) so you can compare apples to apples. Source links go to each vendor's live pricing page for verification.

Estimate inference markup with our LoRA training cost on H100 and synthetic data inputs with synthetic data cost per 1K examples. For method choice, see LoRA vs QLoRA vs full fine-tuning cost and DPO vs RLHF vs ORPO 2026.

Digital Dashboard Hub

Calculator told you what GPT-5 / Claude / Gemini costs. DDH's AI Prompt Builder writes prompts cheap-by-construction — cache-anchored prefix, batch-ready, output capped — so the same task runs at a fraction of the price the calc shows.

Start free 14-day trial — AICHAT30 = 30% off Pro for 3 months. →

Fine-tuning training cost per 1M tokens by model and platform, June 2026

Feature	Model + Platform	Method	Per 1M training tokens
GPT-5 (OpenAI)	SFT	$125 / 1M	$2,812
GPT-5 mini (OpenAI)	SFT	$25 / 1M	$562
GPT-5 nano (OpenAI)	SFT	$12 / 1M	$270
GPT-5 mini DPO (OpenAI)	DPO	$45 / 1M	$1,012
GPT-5 mini RFT (OpenAI)	RFT	$80 / 1M + grader calls	$1,800 + grader
GPT-4o (OpenAI)	SFT	$25 / 1M	$562
GPT-4o-mini (OpenAI)	SFT	$3 / 1M	$67
Claude Haiku 4.5 (Bedrock)	SFT	$45 / 1M	$1,012
Claude Sonnet 4.6 (Bedrock)	SFT	$90 / 1M	$2,025
Gemini 2.5 Flash (Vertex)	SFT	$8 / 1M (free tier covers first 1M tokens monthly)	$180 (less if within free tier)
Gemini 2.5 Pro preview (Vertex)	SFT	$40 / 1M	$900
Llama 4 8B LoRA (Together)	LoRA	$0.40 / 1M	$9
Llama 4 70B LoRA (Together)	LoRA	$1.20 / 1M	$27
Llama 4 70B full (Together)	Full	$4.00 / 1M	$90
Llama 4 405B LoRA (Together)	LoRA	$6.00 / 1M	$135
Llama 4 70B LoRA (Fireworks)	LoRA	$1.80 / 1M	$41
Llama 4 70B DPO (Together)	DPO	$2.40 / 1M	$54
Mistral 8x22B LoRA (Together)	LoRA	$1.00 / 1M	$23
Qwen 2.5 32B LoRA (Together)	LoRA	$0.80 / 1M	$18
DeepSeek-V3 LoRA (Together)	LoRA	$1.50 / 1M	$34

Sources as of June 2026: OpenAI fine-tuning pricing (https://openai.com/api/pricing), Anthropic via AWS Bedrock pricing (https://aws.amazon.com/bedrock/pricing/), Google Vertex AI pricing (https://cloud.google.com/vertex-ai/pricing), Together AI pricing (https://together.ai/pricing), Fireworks AI pricing (https://fireworks.ai/pricing). Worked example assumes 5,000 examples × 1,500 tokens average × 3 epochs = 22.5M training tokens. Per-token rates are subject to vendor changes — verify before procurement. Deployment hourly costs (Bedrock provisioned throughput, Vertex endpoint hours) are additional and not included in worked example.

Why the cost range is 30x

The 30x span between the cheapest fine-tune (Llama 4 8B LoRA on Together at $9 for a typical job) and the most expensive (GPT-5 full SFT at $2,812 for the same job) reflects three different sources of cost difference.

**Model size**: bigger models are more expensive to train because forward/backward passes touch more parameters and weights consume more GPU time. GPT-5 is several hundred billion parameters; Llama 4 8B is 8 billion. The 30-100x size ratio drives 5-15x of the cost difference.

**Method efficiency**: LoRA touches ~0.1-1% of the parameters during training while full fine-tuning touches all of them. The 20-30x compute efficiency gain of LoRA over full fine-tuning is the largest single cost lever in fine-tuning. See our LoRA vs QLoRA vs full fine-tuning cost deep-dive.

**Platform margin and infrastructure**: hosted platforms charge a margin on top of raw GPU cost. Together AI runs at the lowest margin (closest-to-cost per-token rates), Fireworks at moderate margin (premium for the serving stack), OpenAI and Anthropic at the highest margin (frontier-model premium, brand value, and tight integration). Self-hosting on raw GPU hours (Lambda, CoreWeave, RunPod) is the cheapest of all but adds engineering cost.

Inference markup — the cost that runs forever

Training cost is one-time. Inference cost compounds with traffic and often exceeds the training cost within weeks of going live.

**OpenAI fine-tuned models** are priced at approximately 2x base model input and 1.5x base model output. For a GPT-5 mini fine-tune, base model is $0.40/1M input + $1.60/1M output; fine-tune is approximately $0.80/1M input + $2.40/1M output. At 100M tokens/month inference (split 30/70 input/output), monthly inference cost on the fine-tune is approximately $192 versus $124 on the base — a $68/month premium that adds up.

**Anthropic fine-tuned models** on Bedrock charge approximately 1.5x base model price for inference, plus the provisioned throughput hourly cost. Provisioned throughput for Sonnet 4.6 runs $15-25/hour for one model unit, so always-on serving adds $11K-18K/month minimum independent of traffic.

**Google Vertex fine-tuned models** charge approximately 1.0-1.2x base model price for inference plus the endpoint hourly cost. A Gemini 2.5 Flash endpoint runs at the published Vertex AI prediction rate per hour — small but non-zero, typically $0.10-0.50/hour depending on machine type.

**Together / Fireworks fine-tuned models** charge per token at rates close to or matching the base model — no inference markup. Together Llama 4 70B is $0.88/1M in/out for both base and fine-tunes. Fireworks similar. This is one of the largest cost advantages of open-weight platforms versus closed-source frontier vendors for high-traffic production deployments.

Deployment hourly cost — the hidden floor

Bedrock provisioned throughput and Vertex AI endpoint hours are charged whether or not traffic flows. This deployment floor is the single biggest cost trap in fine-tuning workflows.

**AWS Bedrock provisioned throughput** for Claude Haiku 4.5 fine-tune: approximately $8-12/hour per model unit. For Claude Sonnet 4.6 fine-tune: approximately $15-25/hour per model unit. Always-on for one month: ~$5,800-18,000 depending on model. Yes, that is per month of *idle* serving capacity. For low-traffic production deployments, this floor can dominate total spend by 10-50x over the training cost itself.

**Vertex AI endpoint hours** for Gemini 2.5 Flash fine-tune endpoints: approximately $0.10-0.50/hour depending on machine type (n1-standard, n2, accelerated). Always-on for one month: $73-365. Lower than Bedrock but still non-trivial.

**OpenAI fine-tuned models** have no per-hour serving floor — they auto-deploy and you pay only per inference token. This is the largest cost advantage of OpenAI fine-tuning for low- to medium-traffic production deployments.

**Together / Fireworks serverless inference** has no per-hour serving floor on the standard serverless tier — pay only per token. Dedicated endpoints (for high-traffic or low-latency requirements) reintroduce a per-hour cost.

**Replicate** charges per second of GPU runtime. Cold starts (30-60 seconds) add cost if the model is not warm; always-on warm pools reintroduce hourly floors.

Total cost of ownership — the worked example

Putting all three components together for a realistic workload: 5,000-example fine-tune (22.5M training tokens), deployed for 6 months, serving 10M inference tokens/month.

**OpenAI GPT-5 mini SFT**: $562 training + (10M × 6 months × $0.80/1M input + $2.40/1M output split 30/70) = $562 + ~$1,440 inference = ~$2,000 total over 6 months. No hourly floor.

**Anthropic Claude Haiku 4.5 SFT (Bedrock)**: $1,012 training + provisioned throughput $10/hour × 24 × 30 × 6 = $43,200 hourly floor + ~$3,150 inference (at Bedrock per-token prices) = ~$47,400 total over 6 months. The hourly floor dominates.

**Google Gemini 2.5 Flash SFT (Vertex)**: $0-180 training (depending on free tier) + endpoint hours $0.25/hour × 24 × 30 × 6 = $1,080 + ~$210 inference = ~$1,470 total over 6 months.

**Together AI Llama 4 70B LoRA**: $27 training + (60M total × $0.88/1M) ~ $53 inference = ~$80 total over 6 months. No hourly floor on serverless.

**The takeaway**: for low-traffic production deployments (10M tokens/month is light traffic), open-weight Llama 4 70B LoRA on Together is dramatically the cheapest total cost. Anthropic via Bedrock is dramatically the most expensive because of the provisioned throughput floor. OpenAI's auto-deployed fine-tunes are in the middle. Run the math at *your* traffic volume — the crossover where Bedrock becomes competitive is around 1B tokens/month.

Calculator inputs that matter

When estimating fine-tune cost for your workload, the inputs that drive most of the variance are:

**Training tokens** = (number of examples) × (average tokens per example) × (number of epochs). Token count includes both prompt and completion. For chat-format data, tokens are typed including system messages and role separators. Underestimating this is the most common error.

**Method choice** = SFT vs LoRA vs full fine-tuning vs DPO. Multiplier is approximately 1.0 (LoRA baseline), 3-5x (full fine-tuning), 1.5-2x (DPO).

**Projected monthly inference tokens** = (requests per second) × (seconds in a month) × (average tokens per request). Include both input and output tokens.

**Deployment duration** = how long the fine-tune will be in production. Multi-year deployments accumulate inference and hourly costs that can dwarf training cost.

**Hourly serving floor** = does the platform have a per-hour deployment cost? Bedrock yes ($8-25/hour per model unit), Vertex yes ($0.10-0.50/hour), OpenAI no, Together/Fireworks no on serverless.

**Inference markup** = ratio of fine-tuned model inference price to base model inference price. OpenAI ~2x in / 1.5x out; Anthropic ~1.5x; Vertex ~1.0-1.2x; Together/Fireworks ~1.0x.

Practical cost-control playbook

Five practical levers to reduce total fine-tune cost without sacrificing quality.

**Lever 1: Use LoRA instead of full fine-tuning unless quality forces otherwise.** LoRA is 20-30x cheaper than full fine-tuning and matches quality within 1-3 percentage points on most tasks. Default to LoRA; promote to full only when LoRA quality plateaus below your target.

**Lever 2: Pick the cheapest model that meets your quality bar.** GPT-5 mini fine-tunes are often within 5 percentage points of GPT-5 full fine-tunes at 4x lower cost. Gemini 2.5 Flash fine-tunes are within 8-15 percentage points of Gemini 2.5 Pro at a fraction of the cost. Always run a smaller-model fine-tune as a baseline before committing to the most expensive option.

**Lever 3: Avoid Bedrock provisioned throughput for low-traffic deployments.** If you are serving Claude Sonnet 4.6 fine-tune at less than 100 RPS, the provisioned throughput hourly cost will dominate. Either commit to high traffic, use batch inference (where supported), or reconsider Anthropic via Bedrock as the right platform.

**Lever 4: Use validation loss to stop training early.** Default 3 epochs is reasonable but watch the validation loss curve — if it plateaus or rises after epoch 1, additional epochs are wasted training tokens. Most platforms allow early stopping.

**Lever 5: Iterate on small samples first.** Run a 500-example baseline at $5-15 cost on any platform to validate the workflow and rough quality direction before committing to a 50,000-example $500-2,000 production run.

When fine-tuning is not the right spend

The cheapest fine-tune cost is the one you do not pay because you found a better lever. Three patterns where teams should not fine-tune:

**Prompt engineering has not been exhausted.** A well-engineered system prompt with 3-5 in-context examples often closes 70-80% of the quality gap that fine-tuning would close. Always exhaust prompt engineering first — it is free, fast, and iterable in seconds rather than hours.

**The real problem is factual recall, not behavior.** Fine-tuning shifts a model's behavior distribution; it does not give the model new facts. For up-to-date knowledge, factual grounding, or document-specific Q&A, RAG (Retrieval-Augmented Generation) is the right tool. See when to fine-tune vs RAG vs prompt engineer.

**You have fewer than 200 examples.** Below 200 examples, fine-tuning rarely produces measurable quality improvement on most tasks. Spend the training budget on either getting more examples or on prompt engineering.

Estimating fine-tune cost for your workload

1
Count your training tokens accurately
Training tokens = examples × average tokens per example × epochs. For chat-format data, tokens include system messages, role separators, and any special tokens — not just visible prose. Use the vendor's tokenizer to count (OpenAI tiktoken, Anthropic anthropic-tokenizer, Hugging Face tokenizers). Underestimating training tokens is the most common cost-estimation error.
2
Project monthly inference traffic
Inference cost = monthly requests × average tokens per request × per-token rate. Be honest about traffic projections. Most teams overestimate by 5-10x early in product life; account for ramp curves rather than steady-state launch numbers. The crossover where deployment hourly cost becomes cheaper than per-token serving depends entirely on this number.
3
Account for deployment hourly cost on Bedrock/Vertex
If you are using Anthropic Claude via Bedrock or Google Gemini via Vertex, calculate provisioned throughput or endpoint hours separately from training and per-token inference. Always-on for one month is $5,800-18,000 on Bedrock for typical Claude fine-tunes, $73-365 on Vertex for typical Gemini endpoints. This number dominates total cost for low-traffic production deployments.
4
Compare TCO across platforms before committing
For a 6-month deployment at projected traffic, compute total cost (training + inference + hourly serving) for at least 2-3 candidate platforms. Differences of 5-50x in 6-month TCO are common across platforms for the same workload — particularly the open-weight platforms (Together, Fireworks) vs closed-source frontier vendors (OpenAI, Anthropic via Bedrock) at low traffic.
5
Pick the cheapest model that meets your quality bar
Run a baseline fine-tune on the smallest model in the family (GPT-5 nano, Llama 4 8B, Gemini 2.5 Flash) and measure quality against your held-out eval set. Only promote to larger models if quality is below target. Most production fine-tunes leave significant cost savings on the table by jumping straight to the largest model without testing whether a smaller one suffices.

Digital Dashboard Hub

The numbers above are the ceiling — what the model bills before you structure the prompt for it. DDH writes cache-anchored, model-tuned prompts so your real bill lives near the floor.

Try DDH's AI Prompt Builder — free 14 days, no card. AICHAT30 = 30% off Pro. →

Continue your research on adjacent topics — calculators, rate limits, head-to-head comparisons, and guides.

Related prompt tools

LoRA training cost on H100→Synthetic data cost per 1K examples→LoRA vs QLoRA vs full fine-tuning cost→OpenAI vs Anthropic vs Google fine-tuning→Fine-tune ROI real numbers 2026→

Use the data programmatically

Every page on this site is also exposed as a free, CORS-open JSON endpoint. No auth, no rate limit (fair-use, please cache). License is CC-BY-4.0 — link back to attribution.canonicalUrl in the response.

Endpoint: https://aipromptshub.co/api/calc/fine-tuning-cost-by-model-2026

curl

curl -s 'https://aipromptshub.co/api/calc/fine-tuning-cost-by-model-2026' | jq .

Python

import requests

r = requests.get("https://aipromptshub.co/api/calc/fine-tuning-cost-by-model-2026", timeout=10)
r.raise_for_status()
data = r.json()
print(data["title"])
for source in data.get("sources", []):
    print("source:", source)

JavaScript / Node

// Node 20+ / modern browser
const res = await fetch("https://aipromptshub.co/api/calc/fine-tuning-cost-by-model-2026");
if (!res.ok) throw new Error("HTTP " + res.status);
const fine_tuning_cost_by_model_2026 = await res.json();
console.log(fine_tuning_cost_by_model_2026.title);
for (const source of fine_tuning_cost_by_model_2026.sources ?? []) {
  console.log("source:", source);
}

Spec: /api/openapi.yaml · Docs: /api/docs

Frequently Asked Questions

What is the cheapest model to fine-tune in 2026?

Llama 4 8B LoRA on Together AI at approximately $0.40 per 1M training tokens — a typical 22.5M-token job costs ~$9. For frontier-model fine-tunes specifically, GPT-4o-mini SFT on OpenAI is the cheapest at ~$3/1M tokens (~$67 for a typical job). Gemini 2.5 Flash on Vertex is essentially free for small jobs that fit within the monthly free token allowance.

Why is Anthropic fine-tuning so much more expensive than OpenAI or Google?

Two reasons. First, per-token training rates are higher (~$45/1M for Haiku 4.5, ~$90/1M for Sonnet 4.6 vs $25/1M for GPT-5 mini). Second, Anthropic fine-tuning via Bedrock requires provisioned throughput for serving, which adds a $5,800-18,000/month deployment floor independent of traffic. For low-traffic production deployments, the deployment floor dominates total cost. Anthropic via Vertex AI offers a similar provisioned-throughput model on Vertex endpoints.

Does fine-tuned inference cost more than base model inference?

Yes on closed-source platforms (OpenAI fine-tunes cost ~2x base for input, 1.5x for output; Anthropic ~1.5x base on Bedrock; Vertex ~1.0-1.2x base). No on open-weight platforms (Together and Fireworks charge the same per-token rate for base and fine-tuned models). This is one of the largest TCO advantages of open-weight platforms at production scale.

How do I avoid the Bedrock provisioned throughput cost trap?

Three options. (1) Use OpenAI or Google Vertex Gemini fine-tunes instead — neither has the same magnitude of hourly serving floor. (2) Commit to high traffic (100+ RPS sustained) where provisioned throughput per-token-equivalent cost becomes competitive. (3) Use Bedrock batch inference where supported — large async batches do not require always-on provisioned throughput. For most teams, option 1 (skip Bedrock for fine-tunes) is the cleanest answer.

What's the cheapest way to fine-tune Llama 4 70B?

Together AI Llama 4 70B LoRA at approximately $1.20 per 1M training tokens — a typical 22.5M-token job costs ~$27. Self-hosting on rented H100s (Lambda, CoreWeave, RunPod) at ~$2.50/hour spot can be slightly cheaper at scale (~$15-20 per job for the same workload) but adds significant engineering cost. For most teams, Together AI's hosted LoRA is the practical answer.

Are there free fine-tuning options?

Google Vertex AI Gemini 2.5 Flash SFT includes a substantial monthly free training token allowance — small experiments (a few hundred thousand tokens) often train at zero cost. Together AI, Fireworks AI, and Replicate offer $5-10 credit at signup, enough for several small baseline jobs. OpenAI and Anthropic do not offer free training credits — usage-based from token #1.

How does DPO cost compare to SFT cost?

DPO is approximately 1.5-2x SFT cost on the same model and data. The reason is that DPO needs to compute forward passes through both the trained model and the reference model on each training step, roughly doubling the forward-pass compute. ORPO is cheaper than DPO (1.2-1.5x SFT) because it does not need the reference model. RLHF is dramatically more expensive (5-10x SFT) because of the reward model training and PPO rollouts. See DPO vs RLHF vs ORPO 2026.

Should I fine-tune at all, or just write a better prompt?

Almost always write a better prompt first. Prompt engineering with 3-5 in-context examples typically closes 70-80% of the quality gap that fine-tuning would close, at zero training cost and seconds of iteration time. Only fine-tune when prompt engineering plateaus on a specific behavior pattern that examples can teach better than instructions, and the per-token inference cost savings of a cheaper-after-tune model justify the training spend. See when to fine-tune vs RAG vs prompt engineer.

You priced the fine-tune. Now write the prompts that make the trained model pay off.

Fine-tuning is expensive. A great system prompt is free. Get both right: AI Prompt Generator writes production-ready prompts tuned to your specific model + use case, then your fine-tune deepens the behaviors the prompt already establishes. 14-day free trial.

Browse all prompt tools →