Blog
Honest, research-backed writing on prompt engineering, LLM workflows, and agent design — built around what actually ships in production.
The 7 Canonical LLM Agent Design Patterns and When Each One Wins
Most LLM agent failures are pattern mismatches — the workload needs ReAct but the team built single-shot, or vice versa. The 7 canonical patterns (ReAct, Plan-Execute, Reflection, Multi-Agent, Routing, Tool Use, Augmented Generation) and when each one wins.
Agent Memory Architectures 2026: Short-Term, Long-Term, Semantic, Episodic — Which One When
Stateless LLM calls don't remember anything. Real agents need memory architectures: short-term (within-session), long-term (across-session), semantic (facts), episodic (events). The 2026 patterns + when each wins + tool comparison (Mem0, Letta/MemGPT, Zep, LangMem).
Chain-of-Thought Variants in 2026: Zero-Shot, Few-Shot, Self-Consistency, Tree of Thoughts — and When Each Wins
Chain-of-thought prompting comes in at least 5 variants with substantially different cost-quality profiles. Zero-shot CoT vs. few-shot CoT vs. self-consistency vs. Tree of Thoughts vs. step-back prompting — when each one wins.
Context Window Economics in 2026: When Long Context Pays Off (and When It's Wasted Tokens)
Frontier models offer 200K to 1M token context windows. Most production teams pay for tokens they don't need, or hit recall degradation past the sweet spot. Here's when long context actually helps, the recall-vs-position math, and the cost-quality tradeoff at each tier.
Embeddings ROI 2026: When Vector Search Actually Pays Back vs. Keyword Search
Vector search via embeddings is the default 2026 retrieval choice. But it's not always the right one — keyword/BM25 search beats embeddings for many workloads. The honest ROI math + hybrid retrieval patterns + tool comparison (Pinecone, Weaviate, pgvector, Qdrant, Elasticsearch).
LLM Eval Set Construction 2026: Building the Quality Baseline That Catches Regression Before Users Do
An eval set is the prerequisite for canary deploys + quality monitoring + prompt versioning. 50-500 representative examples + expected output shapes. Here's how to build one from scratch, what to include, and the tools (Braintrust, LangSmith, Promptfoo, OpenAI Evals).
LLM Evals and Grading: Building Production-Grade Evaluation Infrastructure
Most teams ship LLM workloads with no systematic evaluation — vibes-based testing, regressions discovered in production. Real eval infrastructure (rubrics, LLM-as-judge, golden datasets, A/B testing) is the difference between shipping and guessing. Here's the canonical stack.
Function Calling vs. Structured Output: Which One for Production LLM Apps?
Function calling and structured output solve overlapping problems with different mechanics. Function calling = model decides whether to call. Structured output = model must conform to schema. Here's when each wins, the cost-quality differences, and the patterns that combine them.
LLM Caching Strategies 2026: Prompt Cache vs. KV Cache vs. Semantic Cache
Three different layers of caching for production LLM systems — prompt caching (provider-side), KV cache (inference-engine-side), semantic cache (application-side). Each cuts cost + latency differently. Here's when each wins.
LLM Cost Engineering 2026: Token Economics + The 7 Levers That Cut Production Spend 60-90%
LLM bills hit $50K/month at scale before teams notice. The 7 cost levers: model right-sizing, prompt caching, structured output, retrieval-not-context, batching, semantic cache, model cascade. Math for each + tool comparison.
Multi-Agent Orchestration 2026: When to Use Agents vs. Workflows (AutoGen, CrewAI, LangGraph, Swarm, Anthropic)
Agents and workflows look similar but have different failure modes. Agents = LLM picks next step. Workflows = code picks next step. Here's the decision framework + production patterns across AutoGen, CrewAI, LangGraph, OpenAI Swarm, and Anthropic's agent guide.
Prompt Compression in 2026: LLMLingua, RECOMP, AutoCompressors, Selective Context Compared
Long prompts cost 20-50x more per query than necessary. Prompt compression techniques (LLMLingua, RECOMP, AutoCompressors, Selective Context) cut input tokens 60-80% while preserving most output quality. Here's the comparison and when each wins.
Prompt Injection Defense: 5 Strategies That Actually Work in Production
Prompt injection is the LLM equivalent of SQL injection — and most production systems still don't defend against it properly. Here are the 5 canonical defenses (input sanitization, instruction hierarchies, output validation, sandboxing, OWASP LLM Top 10 alignment) with their real effectiveness.
Prompt Versioning + Canary Deploys 2026: The Production-Grade Prompt Release Workflow
Treating prompts as code: versioning, canary rollouts, A/B comparison, rollback. The 2026 patterns for shipping prompt changes to production without quality regression — and the tools (LangSmith, Promptlayer, Helicone, Braintrust) that support it.
Streaming LLM UX 2026: Token-by-Token Patterns, SSE, WebSockets, and the AI SDK Stack
Non-streaming LLM UX waits 5-30 seconds for a complete response. Streaming UX returns the first token in 200-800ms. The 2026 patterns: Server-Sent Events, WebSockets, Vercel AI SDK streamUI, and the production decisions for each.
Structured Output Schema Design 2026: Production Patterns + The 5 Schemas That Break LLMs
Getting an LLM to reliably emit valid JSON requires more than 'respond in JSON'. The 2026 patterns: response_format, JSON Schema, Zod/Pydantic validation, schema design that minimizes hallucination. + the 5 anti-patterns.
System Prompts for RAG vs. No-RAG: Divergent Patterns for the Two Dominant Workflows
RAG and no-RAG workflows have substantially different system prompt requirements. RAG needs grounding + citation + abstention discipline; no-RAG needs reasoning scaffolds + verification prompts. The 2026 patterns + the 3 anti-patterns.
Tool Use, Function Calling, MCP: The Production LLM Integration Stack (2026)
Tool use is how LLMs touch your databases, APIs, and filesystem. Function calling, the Model Context Protocol (MCP), and provider-specific tool patterns — when each wins, the failure modes, and how to architect production systems that don't blow up.
The Hallucination Risk Score: a 6-Factor Metric That Predicts Which Prompts Will Hallucinate
Hallucination isn't random — it's predictable from the prompt's structure. The 6-factor Hallucination Risk Score (specificity gap, citation invitation, recency, niche depth, claim type, output length) predicts hallucination rates before you ship. Here's the formula.
Multi-Shot vs. Zero-Shot Prompting: When Examples Actually Help (and When They're Wasted Tokens)
Few-shot examples lift output quality dramatically on some tasks and waste tokens on others. The pattern is predictable from task type. Here's when 2–5 examples are mandatory, when they're a tax, and the curve of diminishing returns.
The 7-Point Prompt Grading Rubric: Turn Prompt Iteration From Vibe to Measurable Comparison
Most prompt iteration happens by gut feel. The 7-point grading rubric — specificity, constraints, audience definition, format, role clarity, examples, success criteria — turns subjective review into a measurable comparison. Here's the rubric and how to apply it.
RAG vs. Fine-Tuning: When Each One Actually Wins (the Decision Matrix Engineers Need)
RAG and fine-tuning solve different problems — but engineering teams treat them as alternatives. RAG wins for fresh-data needs; fine-tuning wins for behavior-shaping. Here's the decision matrix with cost math and 7 worked scenarios.
System Prompts vs User Prompts: When Each One Actually Moves the Needle
System prompts and user prompts both shape LLM output, but they steer different aspects of behavior. Here's the honest breakdown — what each one actually controls, when system prompts are overrated, and the 6 patterns that ship in production.
Token Cost by Model in 2026: the 30x Pricing Variance Most Engineering Teams Don't Calibrate Against
Frontier model pricing varies 30x across providers and tiers. A workload that costs $850/month on one model can cost $28/month on another with comparable quality. Here's the real per-million-token math, with quality-adjusted cost-per-task for 6 production use cases.