Technique 1 — LLMLingua (token-level perplexity compression)
**Mechanic:** A small language model (typically a 350M-1.3B parameter model) scores each token in the prompt by perplexity. Low-perplexity tokens (predictable, redundant) get dropped; high-perplexity tokens (information-carrying) are preserved. The compressed prompt is fed to the target frontier model.
**Compression ratio:** 2-20× depending on prompt content. Verbose retrieval contexts compress most (10-20×); already-tight prompts compress less (2-5×).
**Quality retention:** Per Jiang et al. 2023 LLMLingua paper, 90-99% of original task performance retained at 4× compression on most benchmarks. Degradation begins past 10× compression on tasks requiring fine-grained source-document detail.
**Production cost:** Small-model inference cost (~$0.05-0.20 per 1M tokens compressed) + the original frontier-model cost on the compressed prompt. Net savings: typically 60-85% of original prompt input cost. Reference implementation: Microsoft Research LLMLingua at microsoft.com/en-us/research and LLMLingua GitHub repository at github.com/microsoft/LLMLingua.
**Best for:** RAG workflows with long retrieved contexts. Multi-document synthesis. Any workflow where the input prompt has been growing organically over time and probably contains redundancy.