What each method actually does — the mechanical difference
These three methods are commonly grouped as "PEFT and full fine-tuning" but the mechanical differences matter for both cost and quality.
**Full fine-tuning** updates every parameter via backpropagation. For a 70B model, that means 70B × 16 bits = 140 GB just for the weights, plus another 280-420 GB for the Adam optimizer state (2-3x weights), plus activations and gradients. Even with mixed-precision training (bf16) and FSDP weight sharding, a 70B full fine-tune requires 8x H100 80GB minimum and often 16x. The result is a complete updated model that can replace the base model in serving — no adapter merging step, just a normal forward pass at base-model latency.
**LoRA** freezes the entire base model and adds a small low-rank update to specific layers (typically the attention Q/V projections, sometimes K/O and the MLP). The adapter matrices are rank-r decompositions — instead of learning a full d × d weight update, you learn d × r and r × d matrices that multiply to approximate it. For Llama 4 70B with rank-16 LoRA on attention layers, the adapter is approximately 30-80 MB of new parameters versus 140 GB for the base. Training compute drops dramatically because: (1) most parameters do not require gradient computation; (2) optimizer state is only for the small adapter; (3) activations of frozen layers can be checkpointed more aggressively.
**QLoRA** combines LoRA with 4-bit base-model quantization (NF4 by default in the bitsandbytes implementation). The base weights are loaded in 4-bit and frozen — they are never updated during training, so the precision loss only affects the forward pass through them. The LoRA adapter on top is trained at fp16/bf16 precision, so the trained parameters themselves are not quantized. The net effect: a 70B model that previously needed 140 GB just for fp16 weights now needs about 35 GB for 4-bit weights — small enough that a single H100 80GB fits the base model plus adapter plus optimizer state plus activations for a typical fine-tune run.
**The key insight**: LoRA is a method for reducing the number of trainable parameters. QLoRA is a method for reducing the memory footprint of the frozen base model so larger models become trainable on fewer GPUs. They are orthogonal optimizations and QLoRA combines both.