What Llama 4 actually is (and why MoE matters)
Llama 4 is the first Llama generation to use a mixture-of-experts (MoE) architecture. In a dense model (Llama 3, GPT-4, Claude 3), every token passes through every parameter — a 70B model uses 70B parameters for every token. In an MoE model, each token is routed through a small subset of 'experts'; Llama 4 Scout has 17B active parameters per token despite 109B total parameters across 16 experts.
The benefit: training and inference compute scale with active parameters, not total parameters. Scout's inference cost is closer to a dense 17B model than to a dense 109B model. The trade-off: memory footprint at inference time still requires loading all 109B parameters into GPU memory (or paging from system memory at a latency cost), which constrains self-hosting to higher-VRAM setups.
Scout and Maverick share the same 17B active parameter count but differ in expert count and routing. Scout's 16 experts route more concentrated; Maverick's 128 experts route more diverse, with each expert specializing more narrowly. Maverick generally outperforms Scout on complex tasks at slightly higher inference cost.
Behemoth is a different beast — ~288B active parameters across 16 experts, positioned as the teacher model for the Llama 4 distillation pipeline rather than for direct production use. As of June 2026, Behemoth remains in training; Meta has not announced a release timeline.