Model card · Verified against Meta + provider docs · 2026-06-20

Llama 4: Full Spec Sheet (June 2026)

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Llama 4 is Meta's fourth-generation open-weight model family, released April 2025. It is the first Llama generation to be natively multimodal (text + image input) and the first to use a mixture-of-experts (MoE) architecture, where each token is routed to a small subset of experts rather than activating the full parameter count. Three variants ship under the Llama 4 umbrella: Scout (smallest, 10M-token context), Maverick (mid-tier, 1M-token context), and Behemoth (largest, in training as of June 2026).

Headline architecture: Scout has 17B active parameters across 16 experts (109B total), 10M-token context. Maverick has 17B active parameters across 128 experts (400B total), 1M-token context. Behemoth has ~288B active parameters across 16 experts (~2T total) and is positioned as the teacher model for the Llama 4 distillation pipeline; it is not yet released for direct use as of June 2026.

Llama 4 is open-weight under the Llama 4 Community License. You can download the weights from Hugging Face or llama.com and run them yourself (vLLM, llama.cpp, TensorRT-LLM, Together's runtime), or call them via hosted APIs. Hosted pricing varies dramatically by provider: Together AI runs Scout at ~$0.18/M input + $0.59/M output and Maverick at ~$0.27/M input + $0.85/M output; Groq, Fireworks, and Cerebras offer different rate structures, some with higher tokens/sec at the cost of higher per-token pricing.

Below: full spec table per variant, hosted-provider pricing snapshot, when Llama 4 is the right call vs frontier closed models, the minimal API request, and 8 FAQs. Sibling pages: Llama 4 cost calculator · Gemini 2.5 Flash spec sheet · GPT-5 mini spec sheet. Write a Llama-tuned prompt free with our ChatGPT prompt generator.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Llama 4 family — spec snapshot (June 2026)

Feature	Scout	Maverick	Behemoth
Provider	Meta	Meta	Meta
Model ID (Hugging Face)	meta-llama/Llama-4-Scout-17B-16E-Instruct	meta-llama/Llama-4-Maverick-17B-128E-Instruct	(in training, June 2026)
Released	April 2025	April 2025	Not yet released
Architecture	MoE 17B active / 16 experts / 109B total	MoE 17B active / 128 experts / 400B total	MoE ~288B active / 16 experts / ~2T total
Context window	10,000,000 tokens	1,000,000 tokens	TBD
Max output tokens	~16K (provider-dependent)	~16K (provider-dependent)	TBD
Modalities (input)	Text, image	Text, image	Text, image
Modalities (output)	Text	Text	Text
Open weight (Llama 4 Community License)
Hosted: Together AI input/output (per 1M)	$0.18 / $0.59	$0.27 / $0.85	—
Hosted: Groq input/output (per 1M)	~$0.11 / $0.34	~$0.20 / $0.60	—
Function calling (provider-supported)
Structured outputs (provider-supported)
Knowledge cutoff	August 2024	August 2024	TBD

Sources verified 2026-06-20: Meta Llama 4 announcement (https://ai.meta.com/blog/llama-4-multimodal-intelligence/), Hugging Face model cards (https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct, https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct), Together AI pricing (https://www.together.ai/pricing), Groq pricing (https://groq.com/pricing). Hosted-provider prices change frequently and vary by deployment (serverless vs dedicated, region, batch). Always check the live provider page before budgeting. Function calling and structured outputs are provider-feature-dependent — supported on Together AI and Fireworks, partially supported elsewhere.

What Llama 4 actually is (and why MoE matters)

Llama 4 is the first Llama generation to use a mixture-of-experts (MoE) architecture. In a dense model (Llama 3, GPT-4, Claude 3), every token passes through every parameter — a 70B model uses 70B parameters for every token. In an MoE model, each token is routed through a small subset of 'experts'; Llama 4 Scout has 17B active parameters per token despite 109B total parameters across 16 experts.

The benefit: training and inference compute scale with active parameters, not total parameters. Scout's inference cost is closer to a dense 17B model than to a dense 109B model. The trade-off: memory footprint at inference time still requires loading all 109B parameters into GPU memory (or paging from system memory at a latency cost), which constrains self-hosting to higher-VRAM setups.

Scout and Maverick share the same 17B active parameter count but differ in expert count and routing. Scout's 16 experts route more concentrated; Maverick's 128 experts route more diverse, with each expert specializing more narrowly. Maverick generally outperforms Scout on complex tasks at slightly higher inference cost.

Behemoth is a different beast — ~288B active parameters across 16 experts, positioned as the teacher model for the Llama 4 distillation pipeline rather than for direct production use. As of June 2026, Behemoth remains in training; Meta has not announced a release timeline.

Pricing math: hosted-provider economics on Llama 4

Llama 4 is open-weight, so pricing varies by where you run it. Hosted serverless APIs from Together AI, Groq, Fireworks, Cerebras, and others charge per-token rates that vary widely by provider, model variant, and deployment mode.

Together AI snapshot (2026-06-20): Scout serverless at $0.18/M input + $0.59/M output. Maverick serverless at $0.27/M input + $0.85/M output. A representative 1,000-in / 500-out call on Scout: `0.001 × $0.18 + 0.0005 × $0.59 = $0.00018 + $0.000295 = $0.000475`. About 0.05¢ per call — among the cheapest frontier-quality calls on any provider.

Groq snapshot (2026-06-20): Scout at ~$0.11/M input + $0.34/M output, with token-per-second output rates often 5-10× higher than other providers (Groq's LPU inference hardware is the differentiator). For latency-sensitive workloads, Groq's higher per-token price often nets cheaper end-to-end at scale because of throughput.

Self-hosting math: a single H100 80GB can serve Scout at modest QPS; an 8×H100 node serves Maverick comfortably. AWS on-demand cost for an 8×H100 p5.48xlarge is ~$98/hour, so per-token break-even vs Together hosting is at sustained ~50-100 calls/sec depending on context length. Most teams under 1M calls/day are economically better served by hosted APIs; above that, self-hosting starts to pencil. Worked $ across providers: Llama 4 cost calculator.

Scout's 10M-token context: the open-weight differentiator

Llama 4 Scout's 10,000,000-token context window is the longest in any open-weight model and the longest in production from any provider. Gemini 2.5 Pro is 1M (closed-weight). Claude Sonnet 4.6 has 1M beta (closed-weight). GPT-5 caps at 400K. Scout fits an entire mid-sized codebase, multiple full-length books, or a year of meeting transcripts in a single call.

Recall across 10M tokens is meaningfully weaker than Gemini 2.5 Pro's recall across 1M — needle-in-haystack benchmarks show degradation past ~2M tokens on Scout. For the strict 'must answer from anywhere in the input' use case, Gemini 2.5 Pro at 1M is the more reliable pick.

Where Scout's 10M genuinely matters: workloads where you'd rather pass the whole document and let the model find the relevant section than build a RAG pipeline. Scout's price ($0.18/M input on Together) makes 10M-context calls economically tractable — a 5M-input call costs $0.90 vs ~$15 on Gemini 2.5 Pro's >200K tier.

Native multimodal: text + image, output is text

Llama 4 is the first Llama generation with native vision built into the base architecture (Llama 3.2 added vision as a separate adapter; Llama 4 has it as a first-class input modality). Pass images alongside text in a multi-content message; the model reasons across both.

Image token accounting depends on the provider implementation but generally follows the standard pattern of ~256-1024 tokens per image at typical resolutions. Together AI and Fireworks document the exact tokenizer behavior on their model pages.

Output is text only. Llama 4 does not natively generate images, audio, or video. For multimodal generation, pair Llama 4 (vision input + text reasoning) with a dedicated generation model (Stable Diffusion / Flux for images, ElevenLabs / Cartesia for audio).

Function calling and structured outputs (provider-dependent)

Llama 4 supports tool use natively — the base model is trained on tool-use traces and exposes a `<|tool_calls_section_begin|>` token sequence to indicate a tool call. The user-facing API surface depends on the hosted provider's implementation.

Together AI exposes OpenAI-compatible function calling (`tools` parameter with JSON Schema) for both Scout and Maverick. Fireworks AI does the same. Groq supports tool use on selected models with somewhat different parameter shape — check provider docs.

Structured outputs (JSON Schema-guaranteed output) are similarly provider-dependent. Together AI and Fireworks support `response_format: {type: 'json_object'}` and `{type: 'json_schema', json_schema: {...}}`. For maximum portability across providers, define a tool whose input schema matches your desired output schema and force-call it — that pattern works across every provider that supports tool use.

When to pick Llama 4 vs frontier closed models

**Pick Llama 4 (hosted)** when: cost is the dominant constraint and you need frontier-tier quality, you need 10M context (Scout — no closed-weight model has this), you need portable compatibility (the same prompt should work across multiple providers without rewrite), or you want the option to self-host later as scale grows.

**Pick Llama 4 (self-hosted)** when: data residency / sovereignty / on-premise requirements rule out hosted APIs, you have GPU capacity to amortize, or you've measured >$50K/month hosted bills and the break-even math works.

**Pick frontier closed models (GPT-5, Claude Opus, Gemini 2.5 Pro)** over Llama 4 when: you need top-tier reasoning quality on hard tasks (Llama 4 is competitive on benchmarks but typically a half-generation behind the frontier closed models on the hardest tasks), you need the closed-model tooling ecosystems (Responses API, Anthropic's prompt caching mechanics, Gemini's built-in tools), or your team prefers a single-vendor support relationship over multi-provider portability.

Verified sources and how to re-check the numbers

Every number on this page was verified against Meta's announcement, Hugging Face model cards, and hosted-provider pricing pages on 2026-06-20. Sources: ai.meta.com/blog/llama-4-multimodal-intelligence for the family overview; Hugging Face model cards for context, modalities, and license; together.ai/pricing for Together AI hosting; groq.com/pricing for Groq.

Hosted-provider pricing moves frequently and varies by deployment (serverless vs dedicated, region, reserved capacity). The snapshot in this guide is a point-in-time reference; re-verify the live provider page before committing budget.

Methodology: when a number could not be cross-confirmed against an official source on the verification date, it was omitted from this card rather than guessed.

Call Llama 4 on a hosted provider in 5 steps

1
Pick a hosted provider
Together AI (broadest model menu, OpenAI-compatible API), Groq (highest tokens/sec, smaller menu), Fireworks (good middle-tier balance), Cerebras (highest throughput on selected models). For most production use cases, start on Together AI for portability.
2
Get an API key
Together: api.together.xyz → Settings → API Keys. Groq: console.groq.com → API Keys. Each provider's billing is independent — there is no Meta-wide billing relationship for hosted Llama.
3
Use the OpenAI SDK (or provider SDK)
Together is OpenAI-compatible: `client = OpenAI(base_url='https://api.together.xyz/v1', api_key=os.environ['TOGETHER_API_KEY'])`. Groq is OpenAI-compatible: `base_url='https://api.groq.com/openai/v1'`. Existing OpenAI code works with a base-URL change.
4
Send a minimal Scout call
Python on Together: `r = client.chat.completions.create(model='meta-llama/Llama-4-Scout-17B-16E-Instruct', messages=[{'role': 'user', 'content': 'Hello'}]); print(r.choices[0].message.content)`. Replace model ID with Maverick for the larger variant.
→ Open the ChatGPT prompt generator
5
Add structured outputs (where supported)
On Together AI: pass `response_format={'type': 'json_schema', 'json_schema': {...}}`. For maximum portability: define a tool, force-call it. Tool-use-as-structured-output works across every Llama 4 hosting provider.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Prompt generator (Llama-tuned)→Gemini 2.5 Flash spec sheet→GPT-5 mini spec sheet→Llama 4 cost calculator→Groq vs Cerebras vs Together inference→

Frequently Asked Questions

How much does Llama 4 cost in 2026?

Open-weight, so cost depends on where you run it. Together AI snapshot (2026-06-20): Scout at $0.18/M input + $0.59/M output; Maverick at $0.27/M input + $0.85/M output. Groq at ~$0.11/$0.34 (Scout) and ~$0.20/$0.60 (Maverick). Self-hosting on an 8×H100 node breaks even with hosted APIs around 50-100 calls/sec sustained. Source: together.ai/pricing and groq.com/pricing, verified 2026-06-20.

What is the difference between Llama 4 Scout, Maverick, and Behemoth?

Scout: 17B active / 16 experts / 109B total, 10M context — the smallest, fastest, cheapest. Maverick: 17B active / 128 experts / 400B total, 1M context — better quality on complex tasks. Behemoth: ~288B active / 16 experts / ~2T total — Meta's teacher model for the Llama 4 distillation pipeline, in training as of June 2026, not yet released for direct use.

What is Llama 4 Scout's context window?

10,000,000 tokens — the longest in any open-weight model and the longest in production from any provider. Recall across 10M is weaker than Gemini 2.5 Pro's 1M but the price ($0.18/M input on Together) makes 10M-context calls economically tractable for use cases where RAG isn't worth building.

Is Llama 4 actually multimodal?

Yes — text + image input is native in the base architecture. Pass images alongside text in a multi-content message. Image token accounting follows the standard ~256-1024 tokens per image at typical resolutions; exact behavior depends on the hosted provider's tokenizer. Output is text only — Llama 4 does not natively generate images, audio, or video.

Can I download and run Llama 4 myself?

Yes. Llama 4 is open-weight under the Llama 4 Community License. Download from Hugging Face or llama.com, run with vLLM (recommended for production), llama.cpp (CPU + GPU), or TensorRT-LLM (best NVIDIA performance). Scout fits on a single 80GB GPU; Maverick comfortably fits on an 8×H100 node. Self-hosting break-even vs hosted APIs is around 50-100 sustained calls/sec.

Does Llama 4 support function calling and structured outputs?

Yes, but the API surface depends on the hosted provider. Together AI and Fireworks expose OpenAI-compatible `tools` (function calling) and `response_format: {type: 'json_schema', ...}` (structured outputs) for both Scout and Maverick. Groq supports tool use with slightly different parameter shapes. Self-hosted Llama 4 supports tool use via the base model's native `<|tool_calls_section_begin|>` token sequence — implementation depends on your serving framework.

What is Llama 4's knowledge cutoff?

August 2024 per Meta's model cards. For anything after that, augment with retrieval or a web-search tool call.

What is the Llama 4 Community License?

An open-weight license that allows commercial use with some restrictions: organizations with >700M monthly active users must request a separate commercial license from Meta. For most teams, the license is functionally equivalent to a permissive open-source license. Read the full text on Meta's licensing page before committing to commercial deployment at scale.

Open-weight frontier is here. Write prompts that travel.

Our AI Prompt Generator writes Llama-tuned prompts (system+user clean split, tool-use-portable across Together, Groq, Fireworks, self-host) based on YOUR business + task. 14-day free trial of DDH Pro, no card.

Browse all prompt tools →