The honest cost stack of self-hosting
The naive self-host pitch goes: "H200 rents for $3.50/hour at Lambda, that's ~$2,520/month, the model is free, we save tens of thousands vs the API bill." Every variable in that sentence is wrong or incomplete.
**GPU rental is the smallest line item once you scale to production.** Yes, a single H200 at Lambda Labs runs $3.50-3.99/hour for reserved 1-year terms; Coreweave is $4.10-4.50/hour; RunPod's secure-cloud H200 sits at $3.99/hour. On-demand without commitment runs 30-50% higher. Spot is theoretically available but production inference cannot tolerate preemption — model loading takes 60-180 seconds and a spot termination during a request kills the user-facing latency budget. Most production self-host operations end up on reserved 1-year terms, which is ~$2,800-3,200/month per H200 all-in.
**Inference server operations** is a real engineering surface. vLLM, SGLang, TGI, and LMDeploy each have their own config tax — quantization formats, KV-cache sizing, continuous-batching parameters, tensor-parallel layouts. A senior engineer spends 2-4 weeks getting the first deployment production-ready and another 1-2 days per model upgrade. That's not in your hourly GPU bill.
**Load balancer, autoscaler, monitoring stack.** You need a request router (NGINX, Envoy, or an ingress controller), Prometheus + Grafana for inference metrics (TTFT, throughput, KV cache occupancy, GPU memory), and structured logs into something queryable (Loki, Datadog, ClickHouse). That's another $500-2,000/month in tooling fees, plus the engineer-time to wire it up.
**The DevOps headcount line.** Production self-hosted inference is not a side-project. Realistically you allocate **1.0-1.5 FTE** of platform/ML-infra engineering to keep one or two models healthy in production — on-call coverage, model upgrades, security patching, eval pipeline maintenance, cost monitoring. At fully-loaded US compensation of $200-250k for senior platform engineers, that's $200k+/year amortized across whatever workload you're running.
**Eval and observability infrastructure.** You need a regression eval suite to know when a model upgrade silently breaks production behavior. Building that suite costs $20-50k in engineering time; running it on every deployment costs another $5-10k/month in synthetic-traffic API calls (typically routed through a held-out closed-source model as the judge). This is the second-biggest hidden cost after headcount.
Monthly self-hosting cost breakdown — Llama 4 Maverick 70B on 1×H200
| Feature | Component | Monthly $ |
|---|---|---|
| 1× H200 (1-year reserved, Lambda Labs) | $2,520-2,900 | |
| Egress + inter-AZ networking | $200-500 | |
| Object storage (model artifacts, KV cache snapshots) | $100-250 | |
| Monitoring stack (Prometheus + Grafana + log retention) | $400-900 | |
| Eval pipeline + synthetic-traffic judge calls | $3,000-8,000 | |
| DevOps FTE allocation (1.0-1.5 FTE × $200k loaded ÷ 12) | $16,500-25,000 | |
| Contingency (utilization waste, cold-start drag, incidents) | $2,000-5,000 | |
| **Total loaded monthly cost** | **$24,700-42,500** |
DevOps headcount dominates the loaded cost. If you can amortize the same 1.0-1.5 FTE across 3-5 hosted models simultaneously (typical for a mature ML-infra team), the per-model loaded cost drops to roughly $9-15k/month — which is where the breakeven math starts to actually work for sub-billion-token workloads. Solo-engineer or part-time self-hosting setups almost never beat the API alternative once incident-response time is honestly accounted for.