Throughput math — tokens per second on H100
Translating H100 hours to dollar cost requires knowing your effective tokens-per-second throughput. The number depends on model, sequence length, batch size, LoRA rank, and the training framework.
**Llama 4 70B LoRA (rank 16, 8K context, batch 4)** on 2x H100 80GB with axolotl + DeepSpeed Zero-2 achieves approximately 3,000-5,000 effective tokens/second/GPU, so 6,000-10,000 tokens/second total across the 2 GPUs. A 22.5M-token job completes in approximately 0.75-1 hour wall-clock = 1.5-2 GPU-hours per GPU × 2 GPUs = 3-4 H100-hours total. Add 20-30% overhead for warmup, checkpointing, and validation runs = 6-8 H100-hours total.
**Llama 4 70B QLoRA (rank 16, 4-bit base, 8K context, batch 2)** on 1x H100 80GB achieves approximately 2,000-3,000 effective tokens/second (slower due to 4-bit dequant overhead in forward pass, but only one GPU is used). A 22.5M-token job completes in approximately 2-3 hours wall-clock = 2-3 H100-hours. Add overhead = 3-4 H100-hours total.
**Llama 4 8B LoRA (rank 16, 8K context, batch 16)** on 1x H100 80GB achieves approximately 25,000-40,000 effective tokens/second. A 22.5M-token job completes in approximately 10-15 minutes wall-clock = 0.2-0.3 H100-hours. Including overhead, 0.5-1 H100-hour.
**Llama 4 405B LoRA (rank 16, 4K context, batch 1)** on 8x H100 80GB cluster achieves approximately 800-1,200 effective tokens/second/GPU. A 22.5M-token job completes in approximately 0.5-0.75 hour wall-clock = 4-6 GPU-hours per GPU × 8 = 32-48 H100-hours total. Including overhead, 40-60 H100-hours.
**Mistral 8x22B LoRA** behaves similarly to Llama 4 70B in throughput — ~3,500-5,000 tokens/sec on 2x H100 80GB.
**Qwen 2.5 32B LoRA** sits between Llama 4 8B and 70B — ~8,000-15,000 tokens/sec on 1x H100 80GB.
**DeepSeek-V3 LoRA** is roughly 70B-class throughput — ~3,000-5,000 tokens/sec on 2x H100 80GB.