Almost every fine-tune in 2026 is either a LoRA (Low-Rank Adaptation) or a full fine-tune, and the choice drives a 5-20x cost gap before you even pick a provider. LoRA freezes the base model's weights and trains a small adapter — typically 1-5% of the parameter count — that slots in at attention and projection layers. Full fine-tuning updates every weight in the base model and produces a self-contained custom checkpoint. Both produce a model you can serve; the costs, quality ceilings, and operational shapes look very different.
On training cost the gap is large. A LoRA adapter for Llama 3.3-70B trains in roughly 3-5 GPU-hours on an H100 cluster for a 24M-token job; on Together's managed LoRA endpoint that comes out to about $21.60 (24 × $0.90/1M) — the same number we used in the worked example above, because Together's headline rate is the LoRA rate. A full fine-tune of the same 70B model on the same 24M tokens runs roughly 35-60 H100-hours on a self-managed RunPod or Modal cluster. At RunPod's ~$2.49/hr for an 80GB H100 SXM in June 2026, that's $87-$150 in pure GPU rental, plus orchestration overhead and a few failed runs you should budget for, landing real-world full-fine-tune cost at $200-$300. The 10x gap between $22 LoRA and $200+ full fine-tune is the headline number to remember.
Quality differences are smaller than the cost gap suggests. Across published benchmarks in 2026 — MMLU-Pro, GSM8K, HumanEval, and most classification tasks — full fine-tuning beats a well-tuned LoRA by 1-3 percentage points. That gap widens when the task demands a large style or format shift from the base model's pretraining distribution: heavy SQL-only output, a non-English low-resource language, a domain-specific DSL, or a strict house-style rewrite can push the gap to 5-8 points. For most production classification, extraction, and assistant-style workloads, the LoRA quality penalty is inside the noise of your eval harness, and you would not see it in production unless you specifically measured for it.
Provider exposure differs sharply. OpenAI, Anthropic, and Google price by training-token rate and never tell you which method they use under the hood — internal leaks and inference-latency profiling suggest OpenAI runs LoRA-style adapters for gpt-4.1-nano and gpt-5.4-mini fine-tunes and full fine-tunes only for the flagship tier, but they neither confirm nor expose the choice. You pay the published rate and get a model id. Open-source platforms expose the choice explicitly. Together AI lists separate LoRA and full-fine-tune SKUs — Llama 3.3-70B LoRA at $0.90/1M training is the headline; full fine-tuning the same base lists at roughly $5.40/1M, a 6x premium. Modal and RunPod let you rent the GPUs and run either path with frameworks like Unsloth, Axolotl, or torchtune; you eat the orchestration cost but get full control.
Portability is where LoRA's structural advantage shows up. A 70B LoRA adapter weighs 50-500MB depending on rank (typically rank 16-64 in 2026 production setups) — small enough to version in object storage, swap at request time, and A/B test five variants from one loaded base model on a single GPU. vLLM and SGLang both support multi-LoRA serving in 2026, letting you keep ten adapters hot per base model and route requests by tenant, task, or experiment. A full fine-tune of a 70B model produces 140GB of float-16 weights; you need a separate deployment per variant, each consuming its own GPU memory, and A/B tests cost N times as much as single-model serving.
The portability story also matters when the base model gets deprecated. Llama 3.1 was state-of-the-art 18 months before this guide; it is now superseded by 3.3 and Llama 4 Scout. A LoRA trained against 3.1 can usually be re-trained against 3.3 in a few hours on the same dataset — your data pipeline, eval set, and hyperparameter sweep all carry over. A full fine-tune is welded to its base; the only path to a newer base is a full retraining cycle. For teams running on a 6-12 month base-model refresh cadence, LoRA cuts the recurring retrain cost by 5-10x.
When full fine-tuning is still the right call: workloads where the 1-3 point quality gap translates into measurable revenue or risk (high-volume classification where 1% accuracy moves a P&L line, safety-critical filtering, regulated extraction with hard-coded format requirements), tasks with very large training corpora (>100M tokens) where LoRA's low-rank decomposition starts losing information, and single-tenant high-volume serving where the per-GPU-memory overhead of a full model is amortized across millions of calls per day. In those cases the $200 vs $22 gap is irrelevant — it amortizes in hours of inference savings.
One more cost line that matters: inference-time overhead. A LoRA adapter adds 1-3% latency over base-model inference when served through vLLM's optimized multi-LoRA path in 2026 — effectively free at production scale. A full fine-tune has zero inference overhead by definition, but takes a separate GPU slot. On a single H100 you can serve a base Llama 3.3-70B with ten LoRA adapters loaded at ~$2.49/hr; serving ten full fine-tunes of the same base requires ten separate deployments at roughly $25/hr in GPU rental alone. For multi-tenant SaaS workloads where each customer gets a custom adapter, this cost gap compounds — LoRA can keep per-tenant cost in the cents while full fine-tunes price the same architecture out of viability below the enterprise tier.
Bottom-line rule for 2026: default to LoRA. Train it on Together at $22 per 24M-token pass, ship it behind a multi-adapter vLLM endpoint, run a held-out eval, and only escalate to a full fine-tune if the quality gap shows up in your business metric. The default catches 80% of production use cases at one-tenth the cost; the escalation path is open if you need it.