Llama 3.1 model variants: what you actually need to run each one
Meta released the Llama 3.1 family in July 2024 in three sizes: 8B, 70B, and 405B parameters. Each targets a different cost-quality tier, and each has hard GPU VRAM requirements that determine which EC2 instances can serve it. The official model card on HuggingFace documents the context window (128k tokens for all three variants) and supported quantization formats.
The Llama 3.1 8B model in full FP16 precision requires approximately 16 GB of VRAM, which means a single A10G (24 GB on g5.xlarge) can serve it with headroom for KV cache. In practice most teams run 8B in Q4_K_M quantization (about 4.5 GB), which allows 4–5 concurrent request contexts at once on a single A10G and costs only $1.006/hr on-demand — making it the cheapest production-grade open inference setup on AWS.
The Llama 3.1 70B model requires approximately 140 GB in BF16, which means you need either 4× A10G (96 GB, only feasible with Q4 quantization at ~35–38 GB) or 2× A100 80GB for FP16. The practical sweet spot is the g5.12xlarge at $5.672/hr running 70B in Q4_K_M — good quality, usable throughput (~80–120 tokens/second at batch 8), and roughly 5× cheaper than a p4d.24xlarge.
The Llama 3.1 405B model is in a class of its own. Full BF16 requires ~810 GB of VRAM — 8× A100 80GB (640 GB) is not enough for FP16, so you need either Q4 quantization on a p4d.24xlarge or a full p5.48xlarge (8× H100 SXM, 640 GB) running BF16 with tensor parallelism. The p5.48xlarge runs $98.32/hr on-demand. At that price, the break-even against GPT-5 Pro API calls is genuinely high — only teams processing millions of tokens per hour reach it.