Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LoRA Training Cost on H100 (2026): Real GPU-Hour Math

H100 80GB hourly rates settled in 2026 at $2.50-3.50/hour spot and $4.50-7.50/hour on-demand across the major GPU clouds — Lambda Labs, CoreWeave, RunPod, and Vast.ai. LoRA fine-tuning on these GPUs converts cleanly to dollar cost: a typical Llama 4 70B LoRA run with 5,000 examples completes in 6-8 H100-hours and costs $15-36 depending on spot vs on-demand. This page collects throughput numbers, per-cloud H100 rates, and per-job dollar cost for the open-weight models most commonly LoRA'd in 2026.

By DDH Research Team at Digital Dashboard HubUpdated

LoRA training on H100s is the cost-efficient sweet spot for fine-tuning open-weight models in 2026. Compared to hosted platforms (Together AI, Fireworks AI), self-hosting on rented H100s can be 30-50% cheaper but requires running the training pipeline yourself. The math is straightforward once you know two things: (1) the H100 hourly rate at the cloud and tier you can access, and (2) the effective tokens-per-second throughput on your model and rank configuration.

Below: H100 hourly rates from the four major GPU clouds, throughput numbers from axolotl + DeepSpeed Zero-2 on Llama 4 70B and other common targets, and per-job dollar cost for standardized training runs. For full method context, see LoRA vs QLoRA vs full fine-tuning cost. For hosted platform alternatives, see Together vs Fireworks vs Replicate.

Digital Dashboard Hub

Calculator told you what GPT-5 / Claude / Gemini costs. DDH's AI Prompt Builder writes prompts cheap-by-construction — cache-anchored prefix, batch-ready, output capped — so the same task runs at a fraction of the price the calc shows.

Start free 14-day trial — AICHAT30 = 30% off Pro for 3 months.

H100 80GB GPU hourly rates by cloud, June 2026

Feature
Cloud
Spot ($/hour)
On-demand ($/hour)
Commit/reserved discount
Lambda Labs$2.49 (1x), $2.89 (8x cluster)$3.79 (1x), $4.49 (8x cluster)Yes — multi-month reservations available
CoreWeave$2.79 (1x), $3.20 (8x cluster)$4.39 (1x), $5.20 (8x cluster)Yes — 1Y/3Y commit discounts 20-40%
RunPod$2.39 (1x community), $2.99 (1x secure)$3.99 (1x community), $4.69 (1x secure)Yes — savings plans available
Vast.ai$1.80-2.40 (marketplace varies)$3.20-4.50 (marketplace varies)Limited — host-specific deals
AWS p5 (H100)$5.99 (spot, regional variation)$12.30 (on-demand, base price)Yes — Savings Plans 30-50%
GCP A3 (H100)$5.40 (spot, regional variation)$10.80 (on-demand, base price)Yes — committed-use discounts
Azure NDv5 (H100)$6.10 (spot, regional variation)$12.40 (on-demand, base price)Yes — reserved instances 30-50%

Sources as of June 2026: Lambda Labs pricing (https://lambdalabs.com/service/gpu-cloud/pricing), CoreWeave pricing (https://www.coreweave.com/pricing), RunPod pricing (https://www.runpod.io/pricing), Vast.ai marketplace (https://vast.ai/), AWS p5 pricing (https://aws.amazon.com/ec2/instance-types/p5/), GCP A3 pricing (https://cloud.google.com/compute/gpus-pricing), Azure NDv5 pricing (https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/). Prices fluctuate; figures cited are mid-range for June 2026. Hyperscaler prices (AWS, GCP, Azure) are 2-3x specialized-GPU-cloud prices but include enterprise-grade network, IAM, and compliance features that the specialized clouds do not.

Throughput math — tokens per second on H100

Translating H100 hours to dollar cost requires knowing your effective tokens-per-second throughput. The number depends on model, sequence length, batch size, LoRA rank, and the training framework.

**Llama 4 70B LoRA (rank 16, 8K context, batch 4)** on 2x H100 80GB with axolotl + DeepSpeed Zero-2 achieves approximately 3,000-5,000 effective tokens/second/GPU, so 6,000-10,000 tokens/second total across the 2 GPUs. A 22.5M-token job completes in approximately 0.75-1 hour wall-clock = 1.5-2 GPU-hours per GPU × 2 GPUs = 3-4 H100-hours total. Add 20-30% overhead for warmup, checkpointing, and validation runs = 6-8 H100-hours total.

**Llama 4 70B QLoRA (rank 16, 4-bit base, 8K context, batch 2)** on 1x H100 80GB achieves approximately 2,000-3,000 effective tokens/second (slower due to 4-bit dequant overhead in forward pass, but only one GPU is used). A 22.5M-token job completes in approximately 2-3 hours wall-clock = 2-3 H100-hours. Add overhead = 3-4 H100-hours total.

**Llama 4 8B LoRA (rank 16, 8K context, batch 16)** on 1x H100 80GB achieves approximately 25,000-40,000 effective tokens/second. A 22.5M-token job completes in approximately 10-15 minutes wall-clock = 0.2-0.3 H100-hours. Including overhead, 0.5-1 H100-hour.

**Llama 4 405B LoRA (rank 16, 4K context, batch 1)** on 8x H100 80GB cluster achieves approximately 800-1,200 effective tokens/second/GPU. A 22.5M-token job completes in approximately 0.5-0.75 hour wall-clock = 4-6 GPU-hours per GPU × 8 = 32-48 H100-hours total. Including overhead, 40-60 H100-hours.

**Mistral 8x22B LoRA** behaves similarly to Llama 4 70B in throughput — ~3,500-5,000 tokens/sec on 2x H100 80GB.

**Qwen 2.5 32B LoRA** sits between Llama 4 8B and 70B — ~8,000-15,000 tokens/sec on 1x H100 80GB.

**DeepSeek-V3 LoRA** is roughly 70B-class throughput — ~3,000-5,000 tokens/sec on 2x H100 80GB.


Per-job dollar cost at the standardized workload

Standardized workload: 5,000 examples × 1,500 tokens average × 3 epochs = 22.5M training tokens. Dollar cost = H100-hours × hourly rate.

**Llama 4 70B LoRA**: 6-8 H100-hours. At $2.50/hr spot (Lambda/RunPod): $15-20. At $4.50/hr on-demand: $27-36. At $5.99/hr spot AWS p5: $36-48.

**Llama 4 70B QLoRA**: 3-4 H100-hours (single GPU). At $2.50/hr spot: $8-10. At $4.50/hr on-demand: $14-18.

**Llama 4 8B LoRA**: 0.5-1 H100-hour. At $2.50/hr spot: $1.25-2.50. At $4.50/hr on-demand: $2.25-4.50.

**Llama 4 405B LoRA**: 40-60 H100-hours (8x cluster). At $2.89/hr cluster spot: $115-174. At $4.49/hr cluster on-demand: $180-270.

**Mistral 8x22B LoRA**: 6-8 H100-hours. At $2.50/hr spot: $15-20.

**Qwen 2.5 32B LoRA**: 2-3 H100-hours. At $2.50/hr spot: $5-7.50.

**Comparison to hosted platforms**: Together AI charges $1.20/1M tokens for Llama 4 70B LoRA = $27 for the same job, slightly more expensive than the cheapest self-hosted spot ($15-20) but less than on-demand or AWS p5. Self-hosting saves 30-50% on training cost but adds engineering effort: cluster setup, environment configuration, monitoring, and recovery from spot interruption.


Spot vs on-demand vs reserved — the right choice

H100 capacity comes in three tiers and the right choice depends on workload urgency and engineering tolerance.

**Spot pricing** is 40-50% cheaper than on-demand but can be reclaimed by the cloud at any time, usually with 30-60 seconds notice. For LoRA training jobs that complete in 1-4 hours, spot interruption probability is low (typically <5% per hour on Lambda/CoreWeave). For long jobs (8+ hours), interruption probability compounds and you should checkpoint frequently. With aggressive checkpointing (every 30 minutes), spot interruption is recoverable with minor cost overhead.

**On-demand pricing** is the standard rate with no interruption risk. The right choice for production training jobs where wall-clock time matters or for clusters > 2 GPUs where coordinating recovery from partial interruption is complex.

**Reserved or committed-use pricing** offers 20-50% discounts on multi-month commitments. The right choice for teams running continuous training workloads (model versioning, A/B testing, periodic retraining). Lambda Labs offers month-long reservations at meaningful discounts; CoreWeave 1Y commitments hit 30-40% off list.

**Marketplace pricing (Vast.ai)** is the cheapest at $1.80-2.40/hour for H100 but variable — quality and reliability depend on the individual host. Acceptable for experimental runs and low-stakes training; not recommended for production deployments where consistency matters.


Multi-GPU scaling efficiency

Most LoRA jobs above 70B parameters benefit from multi-GPU scaling. Efficiency is not perfect — there is overhead from gradient communication between GPUs.

**2 GPU scaling** (DeepSpeed Zero-2 or FSDP) on Llama 4 70B LoRA achieves approximately 80-90% scaling efficiency. So 2x H100 produces roughly 1.6-1.8x the throughput of 1x H100.

**4 GPU scaling** achieves approximately 70-80% efficiency on a single node with NVLink (~3-3.2x speedup vs single GPU). Across multiple nodes via InfiniBand or 100GbE, efficiency drops to 50-70%.

**8 GPU scaling** on a single node (8x H100 with NVSwitch) achieves approximately 60-70% efficiency on LoRA (~5-5.5x speedup vs single GPU). Full fine-tuning scales better at 8 GPUs because gradient computation dominates communication; LoRA's much smaller gradient footprint makes the multi-GPU overhead relatively larger.

**The cost implication**: for LoRA on Llama 4 70B, the cheapest configuration is often 1-2 GPUs. Going to 4+ GPUs reduces wall-clock time but increases GPU-hours total because of scaling inefficiency. Pick GPU count by wall-clock requirement, not by minimizing dollar cost.


QLoRA on a single H100 — the canonical cheapest configuration

If wall-clock time is not critical and you want the lowest possible dollar cost, QLoRA on a single H100 80GB is the answer for models up to 70B parameters.

**Llama 4 70B QLoRA on 1x H100 80GB** completes a typical 22.5M-token job in 3-4 H100-hours at $8-15 spot cost. Compared to LoRA on 2x H100 ($15-20 spot), QLoRA on 1x H100 is ~50% cheaper.

**The trade-off**: QLoRA quality is typically 1-2 percentage points below full-precision LoRA on standard benchmarks because of the 4-bit quantization of the base model. For most production use cases, this gap is acceptable given the cost savings. For quality-critical workloads, run LoRA on 2 GPUs.

**Provisioning**: QLoRA on a single H100 80GB has ~30 GB of VRAM headroom after model + adapter + optimizer + activations, enough for batch size 2-4 at 8K context. For larger batch sizes or longer context, either move to 2 GPUs (full LoRA) or stay on QLoRA at smaller batch.


When to self-host vs use a hosted platform

The cost difference between self-hosted H100 training and hosted platforms (Together, Fireworks, Replicate) is real but not large. The decision is more about engineering complexity than pure dollar cost.

**Self-host on rented H100s** when: you have or can build the training infrastructure (axolotl, accelerate, DeepSpeed setup); you are running 10+ training jobs/month and the per-job savings compound; you need configurations (DPO, ORPO, continued pre-training) that hosted platforms do not support; or you want full control over the training environment for reproducibility or compliance.

**Use a hosted platform** when: you are running 1-5 training jobs/month and engineering setup time exceeds the per-job savings; you need standard SFT or LoRA configurations that hosted platforms cover well; you do not have GPU access via cloud accounts; or you want a faster path from data to trained model (typical hosted-platform UX is 30-60 minutes from jsonl upload to deployed model).

**Cost crossover math**: typical hosted-platform markup is 30-50% over self-hosted spot. If you save $10/job on self-hosting and you run 20 jobs/month, that is $200/month in savings. If self-hosting adds 4 hours/month of engineer time at $100/hour engineering cost, the engineering cost ($400) exceeds the savings ($200) — hosted wins. For higher-volume training programs (50+ jobs/month), self-hosting starts to win on TCO.


Practical setup checklist for cheap H100 LoRA

If you decide to self-host, here is the practical setup that produces the cheapest reliable LoRA training runs in 2026.

**Cloud**: Lambda Labs spot or RunPod community spot for one-off jobs; CoreWeave or Lambda Labs reserved for continuous workloads. Avoid hyperscalers (AWS p5, GCP A3) unless you specifically need enterprise networking — they are 2-3x more expensive than specialized GPU clouds.

**Framework**: axolotl (https://github.com/OpenAccess-AI-Collective/axolotl) is the most popular and well-supported framework in 2026 for LoRA on Llama 4 and similar models. unsloth is faster (~30% throughput uplift on single GPU) but supports fewer models. LLaMA-Factory is a good alternative with broad model support and a config-driven workflow.

**Storage**: cloud-native storage (S3, GCS, R2) for training data and checkpoints — cheap and fast enough for typical LoRA workflows. Avoid local-disk-only setups where job failure loses checkpoints.

**Monitoring**: Weights & Biases (https://wandb.ai/) free tier covers most personal training; paid tiers for team workflows. wandb tracks loss curves, eval metrics, and GPU utilization — useful for debugging slow training jobs and validating that your throughput numbers match expectations.

**Checkpointing**: save adapter + optimizer state every 30 minutes for spot-resilient training. Keep at least the last 3 checkpoints for rollback. axolotl handles this with a few config flags.

Running a cheap LoRA fine-tune on rented H100s

  1. 1

    Pick the right cloud for your job duration

    For one-off jobs that complete in under 8 hours, Lambda Labs or RunPod spot is the cheapest. For multi-day continuous workloads, CoreWeave reserved or Lambda Labs month-long reservations win on rate. Avoid AWS p5, GCP A3, and Azure NDv5 unless you specifically need enterprise networking — they are 2-3x more expensive than specialized GPU clouds.

  2. 2

    Pick the right GPU configuration for your model size

    Llama 4 8B: 1x H100 80GB. Llama 4 70B: 2x H100 80GB for LoRA, 1x H100 80GB for QLoRA. Llama 4 405B: 8x H100 80GB cluster (multi-GPU). Mistral 8x22B and DeepSeek-V3: 2x H100 80GB. Going to more GPUs than needed reduces wall-clock time but increases total GPU-hours because of scaling inefficiency.

  3. 3

    Set up the training framework

    axolotl is the standard for LoRA on most open-weight models in 2026. Clone the repo, install dependencies, and write a YAML config with your base model, dataset path, LoRA rank (16 is the standard default), target modules (Q/V/K/O for attention; add MLP layers for higher quality at marginal extra cost), and learning rate (~1e-4 for LoRA). See fine-tune Llama 4 with axolotl for a complete config.

  4. 4

    Configure aggressive checkpointing for spot resilience

    Save adapter weights and optimizer state every 30 minutes. axolotl handles this with `save_steps` and `save_total_limit` config options. Use cloud-native storage (S3, GCS, R2) for checkpoint targets so spot interruption recovery is fast. The cost of frequent checkpointing is small (a few seconds per save); the cost of losing 4 hours of training to spot interruption without a checkpoint is significant.

  5. 5

    Validate throughput on the first 100 steps

    After kicking off training, check the throughput (tokens/sec on the wandb dashboard or stdout) in the first 100 steps. For Llama 4 70B LoRA on 2x H100, expect 6,000-10,000 tokens/sec total. If you are getting 2,000-3,000, something is misconfigured (wrong precision, wrong batch size, gradient accumulation set wrong) and the training will cost 3-4x more than necessary. Stop and debug before continuing.

Use the data programmatically

Every page on this site is also exposed as a free, CORS-open JSON endpoint. No auth, no rate limit (fair-use, please cache). License is CC-BY-4.0 — link back to attribution.canonicalUrl in the response.

Endpoint: https://aipromptshub.co/api/calc/lora-training-cost-on-h100
curl
curl -s 'https://aipromptshub.co/api/calc/lora-training-cost-on-h100' | jq .
Python
import requests

r = requests.get("https://aipromptshub.co/api/calc/lora-training-cost-on-h100", timeout=10)
r.raise_for_status()
data = r.json()
print(data["title"])
for source in data.get("sources", []):
    print("source:", source)
JavaScript / Node
// Node 20+ / modern browser
const res = await fetch("https://aipromptshub.co/api/calc/lora-training-cost-on-h100");
if (!res.ok) throw new Error("HTTP " + res.status);
const lora_training_cost_on_h100 = await res.json();
console.log(lora_training_cost_on_h100.title);
for (const source of lora_training_cost_on_h100.sources ?? []) {
  console.log("source:", source);
}

Spec: /api/openapi.yaml · Docs: /api/docs

Frequently Asked Questions

What's the cheapest H100 cloud in 2026?

Vast.ai marketplace can hit $1.80-2.40/hour spot but with variable host quality. For reliable production-grade H100s, Lambda Labs (~$2.49 spot) and RunPod community (~$2.39 spot) are the cheapest tier. CoreWeave runs slightly higher ($2.79 spot) but has better networking for multi-GPU clusters. Hyperscalers (AWS, GCP, Azure) are 2-3x more expensive but include enterprise features.

How many H100-hours does a typical Llama 4 70B LoRA fine-tune take?

6-8 H100-hours for a 22.5M-token job (5,000 examples × 1,500 tokens × 3 epochs) using axolotl + DeepSpeed Zero-2 on 2x H100 80GB. QLoRA on a single H100 80GB takes 3-4 H100-hours for the same job. Smaller jobs (1,000 examples) finish in 1-2 H100-hours; larger jobs (50,000 examples) take 60-80 H100-hours.

Should I use spot or on-demand H100s for LoRA training?

Spot for LoRA jobs that finish in under 8 hours. Spot interruption probability is typically under 5% per hour on Lambda Labs or CoreWeave for H100 80GBs, and aggressive checkpointing (every 30 minutes) makes interruption recoverable with minor cost overhead. For long jobs (10+ hours) or production training pipelines where wall-clock matters, use on-demand.

Can I run LoRA on A100s instead of H100s to save money?

Yes. A100 80GB hourly rates are roughly half of H100 ($1.20-1.80 spot vs $2.50-3.50). Throughput on LoRA is roughly 50-70% of H100 (A100 has less peak FLOPS but the LoRA workload is often memory-bound, not compute-bound). Net cost per LoRA job on A100 is typically 60-80% of the H100 cost. For non-urgent training, A100s are a meaningful savings.

What's the cheapest way to fine-tune Llama 4 70B?

QLoRA on a single H100 80GB at spot pricing — approximately $8-15 for a typical 22.5M-token job. The quality trade-off is 1-2 percentage points below full-precision LoRA on standard benchmarks. If quality matters more than dollar cost, LoRA on 2x H100 at ~$15-20 is the next tier. Hosted platforms (Together AI at ~$27) are slightly more expensive but eliminate engineering setup time.

How do I checkpoint a LoRA training run for spot resilience?

Configure axolotl with `save_steps: 100` (or your preferred interval) and `save_total_limit: 3` to keep only the most recent 3 checkpoints. Set `output_dir` to a cloud storage path (S3, GCS, R2) so checkpoints survive instance termination. Restart the job after spot interruption with `--resume_from_checkpoint` pointing to the latest saved checkpoint. The optimizer state must also be saved (default in axolotl) to resume training cleanly.

Does the framework matter much? axolotl vs unsloth vs LLaMA-Factory?

Yes, somewhat. axolotl has the broadest model support and the most active community. unsloth is approximately 30% faster on single-GPU training for supported models but has a narrower model catalog. LLaMA-Factory is a strong third option with broad model support and clean YAML configs. For Llama 4 70B specifically, axolotl is the safest default. For Llama 4 8B on a single GPU where you want maximum throughput, unsloth wins.

How does self-hosted LoRA cost compare to OpenAI fine-tuning?

Dramatically cheaper for open-weight models. OpenAI GPT-5 mini SFT is ~$562 for a 22.5M-token job; self-hosted Llama 4 70B LoRA on rented H100s is ~$15-20. The trade-off is that GPT-5 mini and Llama 4 70B are different models with different capability profiles, so the comparison is not apples-to-apples. For workloads where Llama 4 70B's quality is sufficient, self-hosted LoRA is 25-50x cheaper than equivalent OpenAI fine-tuning.

You priced the LoRA. Now write the prompts that make the trained model shine in production.

A LoRA-tuned Llama 4 or Mistral is only as good as the system prompt that frames its task. AI Prompt Generator writes production-ready prompts tuned to your model + LoRA configuration. 14-day free trial.

Browse all prompt tools →