Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Fine-Tuning Cost Calculator 2026: Train + Serve Pricing Across Every Provider

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Fine-tuning has two cost lines: training (paid once to produce the custom model) and served inference (paid every time you call the model afterward, usually at a markup over the base model rate). In 2026, training rates run $0.50-$25 per 1M training tokens depending on model size, while served-inference rates run 1.5-3x the base model rate on most providers. A few providers also charge a per-day hosting fee for keeping your custom model warm.

Fine-tuning makes economic sense when: you have enough volume for the inference markup to amortize prompt-engineering savings, the task benefits from style or format control that prompts cannot achieve cleanly, or you are running on a smaller cheaper base model that needs to match a larger model's quality on a specific task. Below is the full price table and worked $ math for each canonical case. Quick-estimate base inference cost with our AI prompt cost calculator, or grab the free 2026 fine-tuning cheat sheet PDF.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Fine-tuning training & served-inference pricing — June 2026

Feature
Training $/1M
Served input $/1M
Served output $/1M
Base inference $/1M (in/out)
Hosting
OpenAI gpt-5.4$25.00$3.75$22.50$2.50 / $15.00Included
OpenAI gpt-5.4-mini$8.00$1.13$6.75$0.75 / $4.50Included
OpenAI gpt-5.4-nano$2.50$0.30$1.88$0.20 / $1.25Included
OpenAI gpt-4.1-mini$4.00$0.60$2.40$0.40 / $1.60Included
OpenAI gpt-4.1-nano$1.50$0.15$0.60$0.10 / $0.40Included
Anthropic Claude Haiku 4.5 (Bedrock)$10.00$1.50$7.50$1.00 / $5.00$0.0001/sec hosting after training
Google Gemini 2.5 Flash$3.00$0.30$2.50$0.30 / $2.50Free hosting
Google Gemini 2.5 Flash-Lite$1.50$0.10$0.40$0.10 / $0.40Free hosting
Mistral Small fine-tune$1.00$0.30$0.90$0.30 / $0.90$2/month per fine-tune
Mistral Medium fine-tune$4.00$2.10$6.30$2.10 / $6.30$4/month per fine-tune
Together AI Llama 3.3-70B$0.90$0.88$0.88$0.88 / $0.88Free hosting
Together AI Llama 4 Scout$2.50$1.30$1.30$1.30 / $1.30Free hosting
Cohere Command R7B fine-tune$3.00$0.50$1.50$0.50 / $1.50Free hosting

Sources, as of June 2026: OpenAI fine-tuning (https://platform.openai.com/docs/guides/fine-tuning), Anthropic + AWS Bedrock fine-tuning (https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization.html), Google Vertex AI fine-tuning (https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models), Mistral fine-tuning (https://docs.mistral.ai/capabilities/finetuning/), Together AI (https://docs.together.ai/docs/fine-tuning-overview), Cohere (https://docs.cohere.com/docs/fine-tuning). Training rates are quoted per 1M training tokens (sum of input + output tokens across the dataset, multiplied by epoch count). Served-inference markup over base varies by provider — Mistral, Google, and Together charge near-parity with base; OpenAI charges 1.5x on input, 1.5x on output.

How fine-tuning is billed

Two billing lines, sometimes three. Training is metered per 1M training tokens, where 'training tokens' equals (input + output tokens in your dataset) × (number of training epochs). A 100k-example dataset with 1,000 tokens per example and 3 training epochs bills 300M training tokens.

Served inference is metered per 1M input and 1M output tokens, just like the base model — but at a markup. OpenAI charges 1.5x base input and 1.5x base output for fine-tuned model inference. Google, Mistral, and Together typically charge at or near base parity. Anthropic via Bedrock applies a 1.5x markup similar to OpenAI.

Hosting fees apply on a few providers. Mistral charges a flat monthly per-fine-tune fee ($2-4/month). Anthropic via Bedrock charges per-second for the deployed model unit (typically $0.0001/sec or about $260/month at 24/7 uptime). OpenAI, Google, and Together include hosting in the inference price.

The full formula:

``` training_cost = (dataset_tokens × epochs / 1,000,000) × training_price serve_cost = (monthly_input_tokens / 1,000,000) × ft_input_price + (monthly_output_tokens / 1,000,000) × ft_output_price hosting = per-day or per-month fee (if applicable) total_monthly = serve_cost + hosting + (training_cost / amortization_months) ```


Worked example 1: training cost across the lineup

Reference dataset: 10,000 examples, 800 tokens per example (prompt + completion), 3 epochs = 24M training tokens.

OpenAI gpt-5.4-mini: 24 × $8 = $192. OpenAI gpt-5.4: 24 × $25 = $600. OpenAI gpt-4.1-nano: 24 × $1.50 = $36. Google Gemini 2.5 Flash: 24 × $3 = $72. Mistral Small: 24 × $1 = $24. Together Llama 3.3-70B: 24 × $0.90 = $21.60. Together Llama 4 Scout: 24 × $2.50 = $60. Anthropic Claude Haiku 4.5 (Bedrock): 24 × $10 = $240.

Training cost is small relative to typical inference bills at production scale. For a workload that runs $5,000/month on inference, a $192 training cost amortizes in days. The decision rarely turns on training cost; it turns on whether served inference is cheaper than base + prompt-engineering, and whether quality improves enough to justify the operational complexity.

Open-source fine-tuning on Together is the price leader at $0.90/1M for Llama 3.3-70B. If you can run that quality bar, training a 24M-token dataset for $21.60 is essentially free at production scale.


Worked example 2: monthly served-inference cost

Reference monthly workload: 100k API calls × 1,000 input + 500 output tokens = 100M input + 50M output tokens.

Base gpt-5.4-mini: 100 × $0.75 + 50 × $4.50 = $75 + $225 = $300/month. Fine-tuned gpt-5.4-mini: 100 × $1.13 + 50 × $6.75 = $113 + $337.50 = $450.50/month. The fine-tuned markup costs +$150.50/month.

For fine-tuning to be net cheaper than base + prompt-engineering, the fine-tuned model needs to either eliminate enough prompt tokens to recoup the markup, or replace a more expensive base model. Concretely: if fine-tuning gpt-5.4-mini lets you stop using gpt-5.5 ($5/$30), you save 100 × ($5 - $1.13) + 50 × ($30 - $6.75) = $387 + $1,162.50 = $1,549.50/month over base gpt-5.5. Even after the $150.50 markup over base gpt-5.4-mini, that's a $1,400+/month net win.

Open-source via Together at near-parity inference: 100 × $0.88 + 50 × $0.88 = $132/month. Substantially cheaper than fine-tuned OpenAI mid-tier at $450, though you trade off ecosystem features and operational simplicity.


When fine-tuning is worth the operational overhead

Five canonical cases where fine-tuning pays. First, classification or extraction tasks where a fine-tuned small model matches a base mid-tier model — typical 2026 case: fine-tuning gpt-5.4-nano on 5,000 labeled examples to match gpt-5.4-mini quality on a specific extraction task. Inference cost drops 3x.

Second, style or voice consistency that few-shot prompts cannot fully capture — fine-tuning a small model on 1,000 examples of brand voice produces tighter on-brand output than even a 10-shot prompt on a base model.

Third, output-format strictness. JSON schema adherence, custom DSL, deterministic field ordering — fine-tuning produces more reliable structured output than schema-guided prompting on most tasks.

Fourth, prompt-token reduction at high volume. A fine-tuned model with the instructions baked into weights can serve the same task with a 50-token prompt that a base model needs 1,500 tokens for. At 10M calls/month, the savings dwarf the inference markup.

Fifth, domain-specific knowledge that grounding cannot solve cleanly — fine-tuning on a corpus of internal Slack conversations or company-specific terminology, where retrieval misses the long tail.

Anti-cases: tasks where a top-tier base model already hits the quality bar (the markup never pays back), tasks with very low volume (training cost dominates), tasks where the underlying data changes weekly (you would have to retrain constantly), and tasks where output diversity matters (fine-tuning narrows variance).


Open-source vs proprietary fine-tuning

Proprietary (OpenAI, Anthropic, Google, Mistral) gives you ease of use — upload a JSONL file, wait an hour, get a custom model. No GPU provisioning, no scaling decisions. The trade-off is the markup over base inference rates and the lack of weight portability.

Open-source on Together, Modal, RunPod, or self-hosted gives you near-parity inference cost (you pay roughly the same as base inference, since you control the deployment) and full portability — you own the LoRA adapter or full weights and can move providers. The trade-off is operational complexity and the need to manage your own evals, deployments, and scaling.

For a typical 1-5M call/month production workload, proprietary fine-tuning usually nets out cheaper at the engineering-cost level when you include operations. For 10M+ call/month workloads, the inference markup starts to exceed the operations cost; open-source becomes the cost leader.

Hybrid pattern that works well in 2026: use proprietary fine-tuning to ship fast, switch to open-source on Together once volume crosses the threshold where ops cost amortizes. The migration is straightforward when both sides train on the same JSONL format.


Hidden costs: evals, drift, and retraining

Beyond training and inference, three operational costs catch teams off guard.

Eval cost. Fine-tuned models need a continuous quality bar. The standard pattern is a held-out test set of 100-1,000 labeled examples, scored every time you ship a new version. If you grade with an LLM-as-judge using gpt-5.5, that is 100-1,000 LLM calls per evaluation pass at $0.02/call = $2-$20. Multiply by version count and weekly cadence.

Drift cost. The world changes. A model fine-tuned in January on customer-support tickets will degrade as new product features ship, terminology evolves, and ticket patterns shift. Plan for a retraining pass every 60-90 days, which means training cost is annualized — multiply your $192 training number by 4-6 retrainings per year.

Version-management cost. You will have multiple fine-tuned models in production simultaneously (current, candidate, rollback). On providers with per-month hosting fees this multiplies the bill; on providers with included hosting it is free. Factor this in when picking a provider.

Bottom line: total cost of ownership for a fine-tuned model is 1.5-3x the raw training + inference math when you include ops. Worth it when the savings or quality lift justifies it; expensive when it does not.


LoRA vs full fine-tuning in 2026 — cost, quality, and portability tradeoffs

Almost every fine-tune in 2026 is either a LoRA (Low-Rank Adaptation) or a full fine-tune, and the choice drives a 5-20x cost gap before you even pick a provider. LoRA freezes the base model's weights and trains a small adapter — typically 1-5% of the parameter count — that slots in at attention and projection layers. Full fine-tuning updates every weight in the base model and produces a self-contained custom checkpoint. Both produce a model you can serve; the costs, quality ceilings, and operational shapes look very different.

On training cost the gap is large. A LoRA adapter for Llama 3.3-70B trains in roughly 3-5 GPU-hours on an H100 cluster for a 24M-token job; on Together's managed LoRA endpoint that comes out to about $21.60 (24 × $0.90/1M) — the same number we used in the worked example above, because Together's headline rate is the LoRA rate. A full fine-tune of the same 70B model on the same 24M tokens runs roughly 35-60 H100-hours on a self-managed RunPod or Modal cluster. At RunPod's ~$2.49/hr for an 80GB H100 SXM in June 2026, that's $87-$150 in pure GPU rental, plus orchestration overhead and a few failed runs you should budget for, landing real-world full-fine-tune cost at $200-$300. The 10x gap between $22 LoRA and $200+ full fine-tune is the headline number to remember.

Quality differences are smaller than the cost gap suggests. Across published benchmarks in 2026 — MMLU-Pro, GSM8K, HumanEval, and most classification tasks — full fine-tuning beats a well-tuned LoRA by 1-3 percentage points. That gap widens when the task demands a large style or format shift from the base model's pretraining distribution: heavy SQL-only output, a non-English low-resource language, a domain-specific DSL, or a strict house-style rewrite can push the gap to 5-8 points. For most production classification, extraction, and assistant-style workloads, the LoRA quality penalty is inside the noise of your eval harness, and you would not see it in production unless you specifically measured for it.

Provider exposure differs sharply. OpenAI, Anthropic, and Google price by training-token rate and never tell you which method they use under the hood — internal leaks and inference-latency profiling suggest OpenAI runs LoRA-style adapters for gpt-4.1-nano and gpt-5.4-mini fine-tunes and full fine-tunes only for the flagship tier, but they neither confirm nor expose the choice. You pay the published rate and get a model id. Open-source platforms expose the choice explicitly. Together AI lists separate LoRA and full-fine-tune SKUs — Llama 3.3-70B LoRA at $0.90/1M training is the headline; full fine-tuning the same base lists at roughly $5.40/1M, a 6x premium. Modal and RunPod let you rent the GPUs and run either path with frameworks like Unsloth, Axolotl, or torchtune; you eat the orchestration cost but get full control.

Portability is where LoRA's structural advantage shows up. A 70B LoRA adapter weighs 50-500MB depending on rank (typically rank 16-64 in 2026 production setups) — small enough to version in object storage, swap at request time, and A/B test five variants from one loaded base model on a single GPU. vLLM and SGLang both support multi-LoRA serving in 2026, letting you keep ten adapters hot per base model and route requests by tenant, task, or experiment. A full fine-tune of a 70B model produces 140GB of float-16 weights; you need a separate deployment per variant, each consuming its own GPU memory, and A/B tests cost N times as much as single-model serving.

The portability story also matters when the base model gets deprecated. Llama 3.1 was state-of-the-art 18 months before this guide; it is now superseded by 3.3 and Llama 4 Scout. A LoRA trained against 3.1 can usually be re-trained against 3.3 in a few hours on the same dataset — your data pipeline, eval set, and hyperparameter sweep all carry over. A full fine-tune is welded to its base; the only path to a newer base is a full retraining cycle. For teams running on a 6-12 month base-model refresh cadence, LoRA cuts the recurring retrain cost by 5-10x.

When full fine-tuning is still the right call: workloads where the 1-3 point quality gap translates into measurable revenue or risk (high-volume classification where 1% accuracy moves a P&L line, safety-critical filtering, regulated extraction with hard-coded format requirements), tasks with very large training corpora (>100M tokens) where LoRA's low-rank decomposition starts losing information, and single-tenant high-volume serving where the per-GPU-memory overhead of a full model is amortized across millions of calls per day. In those cases the $200 vs $22 gap is irrelevant — it amortizes in hours of inference savings.

One more cost line that matters: inference-time overhead. A LoRA adapter adds 1-3% latency over base-model inference when served through vLLM's optimized multi-LoRA path in 2026 — effectively free at production scale. A full fine-tune has zero inference overhead by definition, but takes a separate GPU slot. On a single H100 you can serve a base Llama 3.3-70B with ten LoRA adapters loaded at ~$2.49/hr; serving ten full fine-tunes of the same base requires ten separate deployments at roughly $25/hr in GPU rental alone. For multi-tenant SaaS workloads where each customer gets a custom adapter, this cost gap compounds — LoRA can keep per-tenant cost in the cents while full fine-tunes price the same architecture out of viability below the enterprise tier.

Bottom-line rule for 2026: default to LoRA. Train it on Together at $22 per 24M-token pass, ship it behind a multi-adapter vLLM endpoint, run a held-out eval, and only escalate to a full fine-tune if the quality gap shows up in your business metric. The default catches 80% of production use cases at one-tenth the cost; the escalation path is open if you need it.


Five-step decision flow for whether to fine-tune

Step 1: estimate base-model cost on your current workload using our GPT vs Claude vs Gemini cost calculator. Numbers below $500/month rarely justify the operational overhead of a fine-tune; numbers above $5,000/month often do.

Step 2: try prompt engineering first. Few-shot examples, structured output schemas, chain-of-thought prompting, and a fresh look at the system prompt usually close 60-80% of the gap between base and fine-tuned quality at zero ops cost.

Step 3: if prompt engineering plateaus below your quality bar, build a 500-1,000 example labeled dataset. Use a stronger base model (gpt-5.5 or Sonnet 4.6) to bootstrap labels; spot-check 10-20% of them by hand.

Step 4: train a small fine-tune ($20-$200) on a small base model (gpt-5.4-nano, gpt-5.4-mini, Gemini 2.5 Flash, or Llama 3.3-70B via Together). Compare against base mid-tier on your held-out test set.

Step 5: if the fine-tuned small model matches base mid-tier on quality, ship it — you have likely just cut inference cost 3-5x. If it does not, either the base mid-tier model is the right answer, or the gap is in the data (more examples, better labels) rather than the technique.

Frequently Asked Questions

What is the cheapest fine-tunable model in 2026?

Together AI Llama 3.3-70B at $0.90/1M training and $0.88/1M near-parity inference is the cheapest hosted fine-tune option among major providers. OpenAI gpt-4.1-nano at $1.50/1M training is the cheapest proprietary option.

Does fine-tuning save money on inference?

Not directly — most providers charge 1.5x base for fine-tuned inference. Fine-tuning saves money when it lets you drop to a cheaper base tier (e.g., from gpt-5.5 to fine-tuned gpt-5.4-mini) or eliminates a long instruction prompt. Otherwise it costs more per call, not less.

What is the training-token formula?

training_tokens = (sum of input + output tokens across your dataset) × epoch_count. A 10k-example dataset with 800 tokens per example and 3 epochs = 24M training tokens. Multiply by the training $/1M rate.

Should I fine-tune or use prompt engineering?

Try prompt engineering first. Few-shot examples, structured-output schemas, and a tightened system prompt usually close 60-80% of the gap to fine-tuning at zero ops cost. Fine-tune only when prompt engineering plateaus below your quality bar.

How often do I need to retrain?

Plan a retraining pass every 60-90 days for most production workloads. Underlying data drifts (product changes, terminology, customer behavior) and the model needs to be re-aligned. Budget for 4-6 retraining cycles per year.

Can I fine-tune Claude?

Yes — Anthropic offers fine-tuning for Claude Haiku 4.5 through AWS Bedrock. Training rate is roughly $10/1M training tokens with a 1.5x markup on served inference. Confirm against AWS Bedrock model customization docs.

Can I fine-tune GPT-5.5?

Not as of June 2026. OpenAI's flagship fine-tunable models in 2026 are gpt-5.4 ($25/1M), gpt-5.4-mini ($8/1M), and gpt-5.4-nano ($2.50/1M). Confirm on OpenAI's fine-tuning page for the current list.

Is open-source fine-tuning cheaper than proprietary?

Usually yes on the raw inference bill — Together AI charges near-parity vs base inference, while OpenAI marks up 1.5x. Operationally, open-source costs more in engineering time, deploy management, and eval infrastructure. For >10M call/month workloads, open-source typically wins net-of-ops.

What is the cost difference between LoRA and full fine-tuning?

Typically 5-20x in training cost. A 24M-token LoRA fine-tune of Llama 3.3-70B on Together AI runs about $22 (24 × $0.90/1M). A full fine-tune of the same base on RunPod or Modal runs $200-$300 in GPU rental (35-60 H100-hours at ~$2.49/hr plus orchestration overhead). Quality typically differs by only 1-3 points on standard benchmarks, so LoRA is the right default unless that gap moves a real business metric.

Do OpenAI and Anthropic use LoRA under the hood?

They don't disclose it. Inference-latency profiling and intermittent leaks suggest OpenAI uses LoRA-style adapters for fine-tunes of smaller models like gpt-4.1-nano and gpt-5.4-mini, while reserving full fine-tuning for the flagship tier. Anthropic and Google don't expose the method either. You pay the published training-token rate and get back a model id — the method is abstracted away. If you need explicit control over LoRA vs full, use open-source providers like Together AI, Modal, or RunPod, which expose the choice as separate SKUs.

Can I A/B test multiple LoRA adapters from one base model?

Yes — that's one of LoRA's structural advantages. A 70B LoRA adapter weighs 50-500MB (rank 16-64 in typical 2026 setups), small enough to keep ten adapters hot per base on a single GPU. vLLM and SGLang both support multi-LoRA serving in 2026, letting you route requests by tenant, task, or experiment without spinning up a deployment per variant. Full fine-tunes produce multi-GB checkpoints (140GB for a 70B at fp16) that require a separate deployment per variant — A/B tests cost N times as much as single-model serving.

Get the 2026 fine-tuning cheat sheet

One-page PDF with every fine-tunable model's training rate, served-inference rate, and hosting fee — free, no signup gate.

Browse all prompt tools →