Where the alignment tax actually shows up on your invoice
**Component 1: refusal output tokens.** When a safety-tuned model refuses, modern frontier models produce a structured refusal: brief acknowledgment + reason + alternative phrasing. Typical refusal length: 100-300 output tokens. If 5-15% of your traffic triggers borderline refusals (varies widely; consumer apps higher, B2B SaaS lower), refusal tokens add 1-3% to your output bill.
**Component 2: disclaimer + safety wrap.** Even on helpful responses, modern instruction-tuned models often add 30-80 tokens of context-setting and safety wrap around the substantive answer. On long-form helpful responses (1000+ output tokens), this is ~3% overhead. On short structured outputs (<200 tokens), can be ~10-20% overhead. The fix is system-prompt design that explicitly asks for unwrapped output for structured cases.
**Component 3: system-prompt boilerplate.** Safety-aware system prompts often include policy text, instruction-hierarchy directives, refusal-style preferences, and brand voice. 200-500 input tokens of boilerplate. At Sonnet 4.6 input rate of $3/1M tokens, 500 boilerplate tokens × 1M calls = $1.50. **With prompt caching enabled (Anthropic's `cache_control` blocks, OpenAI's `prompt_cache_key`, Gemini's implicit caching), the cached prefix bills at 90% discount or 0%** — making the system-prompt component effectively free at high volume. Discipline matters: any prompt with stable safety boilerplate should be cache-anchored.
**Component 4: separate classifier API calls.** Many teams run input-side classifier checks before passing to the main model — Rebuff for prompt injection, Lakera Guard, NVIDIA NeMo Guardrails, or a cheap LLM classifier (Haiku 4.5, GPT-5-mini). Adds one cheap API call per prompt. At Haiku 4.5 rate of $1/$5 per 1M, a 100-input-token classifier check costs ~$0.0001 per call = $100 per 1M prompts. Order of magnitude cheaper than the primary call.
**Total picture.** For a typical B2B SaaS workload at $1000/M-prompts on Sonnet 4.6, total alignment tax is $50-$150 (uncached) or $20-$80 (with proper caching). Material but not enormous. For consumer chat workloads with higher refusal rates, the tax can reach $200/M-prompts on a $1000 base.