Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Caching LLM Responses in Redis: A Full Tutorial

A practical, code-first tutorial for caching LLM responses in Redis. Covers exact-match caching, semantic caching with embeddings, cache-key hashing, TTL strategy, and invalidation patterns — with real code in Python and Node, and real 2026 prices for GPT-5, Claude Sonnet 4.6, and Gemini 2.5 Pro.

By DDH Research Team at Digital Dashboard HubUpdated

At current 2026 API prices — GPT-5 at $2.50/1M input tokens, Claude Sonnet 4.6 at $3.00/1M input tokens, Gemini 2.5 Pro at $1.25/1M input tokens — a mid-sized app making 500k LLM calls per month is spending $1,250–$1,500 per month before any optimization. Add in multi-turn context and tool definitions and that bill climbs fast. A Redis caching layer in front of those calls costs roughly $15–$60/month on Redis Cloud or Upstash and can cut your effective LLM spend 60-90% on workloads with repeated or near-repeated prompts.

This tutorial builds two caching patterns from scratch: (1) an exact-match cache keyed on a SHA-256 hash of the normalized prompt, and (2) a semantic cache that uses embeddings to match prompts that are semantically equivalent but worded differently. We also cover TTL strategy, cache invalidation, and a hybrid routing pattern that combines both. Before you implement any of this, plug your current monthly token volume into our AI Prompt Cost Calculator to get a baseline number — it makes the ROI of each section concrete.

For the broader cost-optimization picture beyond caching, see the AI Cost Optimization Checklist 2026. For prompt-level caching at the provider API layer (a different mechanism), see How to Use Prompt Caching to Cut Costs.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card — AICHAT30 = 30% off Pro.

Redis caching patterns vs typical LLM workload fit

Feature
Cache type
Typical hit rate
Best for
Added infra cost
Exact-match (SHA-256 key)20-60%FAQ bots, templated reports, classification at scale$0 (pure Redis GET/SET)
Semantic cache (embedding similarity)40-80%Conversational assistants, paraphrased queries$0.02/1M tokens (text-embedding-3-small)
Hybrid (exact first, semantic fallback)50-85%General-purpose assistants with mixed trafficMinimal — embedding only fires on exact miss
Provider prompt caching (Anthropic / OpenAI)70-90% on stable prefixAgent loops, RAG with stable context$0 — built into provider pricing

Hit rate estimates based on production workloads with >100k calls/month. Prices as of June 2026 from openai.com/pricing and anthropic.com/pricing.

Why Cache at the Application Layer at All?

Provider-side prompt caching (from Anthropic and OpenAI) caches the stable prefix of a prompt within a session window — 5-10 minutes for OpenAI, up to 1 hour for Anthropic. That is excellent for agent loops and multi-turn conversations, but it does nothing for the same question asked by two different users in different sessions, or the same report generated on Tuesday and again on Thursday.

Application-layer Redis caching sits upstream of the provider API entirely. If you have already seen this exact query (or a semantically equivalent one) and stored the answer, you never make the API call at all. Your cost for that call is $0 — plus a fraction of a cent for the Redis GET. At 60% cache hit rate on a $1,500/month workload, that is $900/month saved. The Redis instance costs $15. The math is straightforward.

The tradeoff is staleness. Cached LLM responses reflect the model's knowledge at cache-write time. For factual queries about current events, prices, or rapidly changing data, short TTLs or cache invalidation are critical. For stable workloads — FAQ answering, structured document extraction, code review, classification — stale responses are rarely a problem and you can run TTLs of 24-72 hours safely. We cover TTL strategy in section 5.


Prerequisites: Redis Setup and Client Libraries

You need a Redis instance with Redis Search enabled for the semantic cache section. For local dev, the easiest path is Docker: `docker run -p 6379:6379 redis/redis-stack:latest`. For production, Redis Cloud starts at $0 (30MB free tier) and Upstash has a serverless plan at $0.20/100k commands. Both include Redis Search out of the box.

Python dependencies: `pip install redis openai anthropic numpy`. Node dependencies: `npm install ioredis openai @anthropic-ai/sdk`. For the semantic cache, you also need vector search support — the `redis-py` client exposes this via `redis.commands.search`. The examples in this tutorial use Python 3.11+ and Node 20+.

Set your environment variables before running any example: `REDIS_URL`, `OPENAI_API_KEY`, and optionally `ANTHROPIC_API_KEY`. Never hardcode API keys in source files. A `.env` file with python-dotenv or Node's `dotenv` package handles this cleanly in development; use secrets management (AWS Secrets Manager, Doppler, or your platform's native secrets) in production.


Pattern 1: Exact-Match Cache with Cache-Key Hashing

The simplest and most cost-effective pattern: hash the normalized prompt into a deterministic key, check Redis for an existing response, and only call the LLM on a miss. The hash ensures the key is always a fixed length regardless of how long the prompt is, and normalization (lowercasing, stripping extra whitespace, sorting serialized parameters) prevents near-identical prompts from missing the cache.

Python implementation: ```python import hashlib import json import os import redis from openai import OpenAI r = redis.from_url(os.environ["REDIS_URL"]) client = OpenAI() TTL_SECONDS = 86400 # 24 hours def normalize_prompt(messages: list[dict], model: str) -> str: """Produce a canonical string for cache-key generation.""" payload = {"model": model, "messages": messages} return json.dumps(payload, sort_keys=True, ensure_ascii=False) def cache_key(normalized: str) -> str: return "llm:v1:" + hashlib.sha256(normalized.encode()).hexdigest() def cached_chat(messages: list[dict], model: str = "gpt-5", **kwargs) -> str: norm = normalize_prompt(messages, model) key = cache_key(norm) # Cache hit hit = r.get(key) if hit: return hit.decode() # Cache miss — call the API response = client.chat.completions.create( model=model, messages=messages, **kwargs ) content = response.choices[0].message.content # Store with TTL r.setex(key, TTL_SECONDS, content) return content ```

Node.js equivalent: ```javascript import { createHash } from 'crypto'; import { createClient } from 'redis'; import OpenAI from 'openai'; const redis = createClient({ url: process.env.REDIS_URL }); await redis.connect(); const openai = new OpenAI(); const TTL_SECONDS = 86400; function normalizePrompt(messages, model) { return JSON.stringify({ model, messages }, Object.keys({ model, messages }).sort()); } function cacheKey(normalized) { const hash = createHash('sha256').update(normalized).digest('hex'); return `llm:v1:${hash}`; } async function cachedChat(messages, model = 'gpt-5', options = {}) { const norm = normalizePrompt(messages, model); const key = cacheKey(norm); const hit = await redis.get(key); if (hit) return hit; const response = await openai.chat.completions.create({ model, messages, ...options }); const content = response.choices[0].message.content; await redis.setEx(key, TTL_SECONDS, content); return content; } ```

The `llm:v1:` namespace prefix lets you bust all cached responses at once by flushing keys matching `llm:v1:*` — useful when you change your system prompt or upgrade the model and want all users to get fresh responses. Increment the version (`v2`, `v3`) rather than flushing to allow gradual rollout: old keys expire naturally while new calls write to the new namespace.


Pattern 2: Semantic Cache with Embeddings and Redis Search

Exact-match caching misses queries that are semantically identical but worded differently: "What is the capital of France?" and "Which city is the capital of France?" produce different SHA-256 hashes and both hit the API. A semantic cache stores the embedding of each query alongside the cached response, then uses cosine similarity to find existing answers for new queries that are close enough.

The cheapest way to generate embeddings is OpenAI's text-embedding-3-small at $0.02/1M tokens — a 50-token query costs $0.000001. Even at 1M queries/month the embedding cost is $1. The text-embedding-3-large model costs $0.13/1M tokens and produces better representations for nuanced queries; for most FAQ-style caching, text-embedding-3-small is sufficient. See our Embedding Cost Calculator 2026 for a full breakdown by volume.

Redis setup for vector search — run once at startup: ```python import numpy as np import redis from redis.commands.search.field import VectorField, TextField from redis.commands.search.indexDefinition import IndexDefinition, IndexType DIM = 1536 # text-embedding-3-small output dimension INDEX_NAME = "semantic_cache" def create_index(r: redis.Redis): try: r.ft(INDEX_NAME).info() except Exception: schema = ( TextField("$.prompt", as_name="prompt"), TextField("$.response", as_name="response"), VectorField( "$.embedding", "HNSW", {"TYPE": "FLOAT32", "DIM": DIM, "DISTANCE_METRIC": "COSINE"}, as_name="embedding", ), ) r.ft(INDEX_NAME).create_index( schema, definition=IndexDefinition( prefix=["scache:"], index_type=IndexType.JSON ), ) ```

Semantic cache lookup and write: ```python from openai import OpenAI import json, os, time from redis.commands.search.query import Query client = OpenAI() SIMILARITY_THRESHOLD = 0.92 # cosine similarity — tune per workload TTL_SECONDS = 43200 # 12 hours for semantic cache def embed(text: str) -> list[float]: resp = client.embeddings.create( model="text-embedding-3-small", input=text ) return resp.data[0].embedding def semantic_cache_lookup(r: redis.Redis, prompt: str): vec = np.array(embed(prompt), dtype=np.float32).tobytes() q = ( Query("(*)=>[KNN 1 @embedding $vec AS score]") .sort_by("score") .return_fields("prompt", "response", "score") .dialect(2) ) results = r.ft(INDEX_NAME).search(q, query_params={"vec": vec}) if results.total == 0: return None, None top = results.docs[0] similarity = 1 - float(top.score) # COSINE distance → similarity if similarity >= SIMILARITY_THRESHOLD: return top.response, similarity return None, None def semantic_cache_store(r: redis.Redis, prompt: str, response: str): embedding = embed(prompt) key = f"scache:{int(time.time() * 1000)}" r.json().set(key, "$", { "prompt": prompt, "response": response, "embedding": embedding, "ts": int(time.time()) }) r.expire(key, TTL_SECONDS) def semantic_cached_chat(r: redis.Redis, prompt: str, model: str = "gpt-5") -> str: cached, similarity = semantic_cache_lookup(r, prompt) if cached: print(f"Semantic cache hit (similarity={similarity:.3f})") return cached messages = [{"role": "user", "content": prompt}] response = client.chat.completions.create(model=model, messages=messages) content = response.choices[0].message.content semantic_cache_store(r, prompt, content) return content ```

The `SIMILARITY_THRESHOLD` of 0.92 is a good starting point. Too low and you return wrong answers; too high and you miss legitimate cache hits. Tune it by logging similarity scores for a few thousand real queries and plotting the distribution — most apps converge on 0.90-0.95. For safety-critical or factual applications, prefer exact-match caching or set the threshold higher (0.97+).


TTL Strategy: How Long Should You Cache LLM Responses?

TTL is the most consequential design decision in your caching layer. Too short and you miss most savings; too long and users see stale answers. The right TTL is a function of how quickly the underlying facts change and how serious a stale answer is. Here is a practical tiered framework: (1) Static content — FAQ answers, product descriptions, documentation summaries: TTL 7 days. (2) Semi-stable — code review, classification, summarization of stable documents: TTL 24 hours. (3) Dynamic — anything touching current prices, news, or user-specific context: TTL 15-60 minutes, or skip caching entirely.

For multi-model deployments, namespace your cache by model. A GPT-5 response and a Claude Sonnet 4.6 response to the same prompt are different artifacts — store them under different key prefixes (`llm:gpt5:v1:`, `llm:claude-sonnet-46:v1:`) and give them independent TTLs. This also means a model upgrade doesn't silently return responses from the old model: bump the version in the key prefix and old responses expire naturally.

Redis supports per-key TTLs natively via `SETEX` (Python `r.setex(key, ttl, value)`) or `EXPIRE`. You can also implement a rolling TTL — reset the TTL every time a key is read, so frequently-accessed responses stay warm indefinitely while rarely-accessed ones expire. In Python: after `r.get(key)`, call `r.expire(key, TTL_SECONDS)` immediately. This is effectively a least-recently-used eviction policy at the application layer, complementing Redis's built-in `allkeys-lru` eviction policy for memory management.

For the Redis LRU eviction docs, set `maxmemory-policy allkeys-lru` in your Redis config so that when memory is full, Redis evicts the least-recently-used keys automatically rather than returning errors. Pair this with a `maxmemory` cap appropriate for your instance size.


Cache Invalidation: Busting Stale Responses

Cache invalidation in an LLM caching layer has two common triggers: (1) you changed your system prompt or model configuration, making old cached responses incorrect relative to the current behavior; and (2) the underlying facts changed (e.g., a product's price changed, a policy was updated) and cached answers are now factually stale.

For trigger (1), use versioned key namespaces. Every time your system prompt, model, or prompt template changes, increment the cache version. Old keys are in `llm:v1:*`; new calls write to `llm:v2:*`. Old keys expire per their TTL without a manual flush, while users immediately get responses that reflect the current system prompt. Python helper: ```python CACHE_VERSION = "v2" # bump this when prompt/model changes def cache_key(normalized: str) -> str: return f"llm:{CACHE_VERSION}:" + hashlib.sha256(normalized.encode()).hexdigest() ```

For trigger (2) — fact-based invalidation — you need explicit key invalidation on data change events. If you're caching product Q&A and a price changes in your database, publish an invalidation event that deletes the relevant cached keys. Redis Pub/Sub or a simple webhook handler works for this: ```python def invalidate_product_cache(r: redis.Redis, product_id: str): """Delete all cached LLM responses mentioning this product.""" pattern = f"llm:*:*product_{product_id}*" cursor = 0 while True: cursor, keys = r.scan(cursor, match=pattern, count=100) if keys: r.delete(*keys) if cursor == 0: break ```

Avoid `KEYS *` in production — it blocks the Redis event loop for large datasets. Always use `SCAN` with a cursor as shown above. For large-scale invalidation (thousands of keys), use Redis pipelines to batch the deletes: `pipe = r.pipeline(); [pipe.delete(k) for k in keys]; pipe.execute()`.


Hybrid Pattern: Exact First, Semantic Fallback

The highest hit rates come from combining both patterns: check the exact-match cache first (one Redis GET, ~0.1ms), then fall back to semantic search only on a miss. This way you pay for embedding generation only when needed, and exact-match queries — which are typically a large fraction of production traffic on FAQ or templated-query workloads — are served at near-zero cost.

```python def hybrid_cached_chat( r: redis.Redis, messages: list[dict], model: str = "claude-sonnet-4-6", **kwargs ) -> str: # Step 1: exact-match check norm = normalize_prompt(messages, model) key = cache_key(norm) hit = r.get(key) if hit: return hit.decode() # Step 2: semantic check (only fires on exact miss) prompt_text = messages[-1]["content"] if messages else "" semantic_hit, _ = semantic_cache_lookup(r, prompt_text) if semantic_hit: # Backfill exact-match cache so future identical queries skip embedding r.setex(key, TTL_SECONDS, semantic_hit) return semantic_hit # Step 3: LLM call import anthropic ac = anthropic.Anthropic() response = ac.messages.create( model=model, max_tokens=1024, messages=messages, **kwargs ) content = response.content[0].text # Store in both caches r.setex(key, TTL_SECONDS, content) semantic_cache_store(r, prompt_text, content) return content ```

This example targets Claude Sonnet 4.6 (anthropic.com lists it at $3.00/1M input, $15.00/1M output as of June 2026). The same pattern works identically for GPT-5 (swap in the OpenAI client) or Gemini 2.5 Pro (swap in the Google Generative AI client). For more on choosing the right model for cost vs. quality, see LLM Caching Strategies: Prompt KV and Semantic.


Cost Modeling: What Redis Caching Actually Saves

Let's run the numbers on a real workload: a customer support bot handling 200k questions per month. Average prompt: 800 input tokens (system prompt + user message), 300 output tokens. Model: GPT-5 at $2.50/1M input + $10.00/1M output. Without caching: 200k × (800 × $2.50 + 300 × $10.00) / 1M = 200k × ($0.002 + $0.003) = **$1,000/month**.

Add Redis caching with a 65% hit rate (realistic for a support bot where many users ask the same 50-100 core questions). Cache hits serve 130k requests at ~$0 LLM cost. Miss traffic: 70k requests × $0.005 = **$350/month**. Redis Cloud at this scale: ~$25/month. Total: **$375/month** vs $1,000 — a **62.5% reduction, saving $625/month or $7,500/year**.

Now add the semantic layer. Without semantic caching, a question like "how do I reset my password" and "what's the process to reset my password" both miss the exact-match cache. With semantic caching at 0.92 threshold, they share a single cached response. This typically pushes hit rate from 65% to 75-80% on conversational workloads — saving another $100-200/month on this example. Total embedding cost at $0.02/1M tokens for 200k × 50-token prompts = $0.20/month. Negligible.

For personalized or context-heavy prompts (user history injected into each message), hit rates drop significantly — exact-match to near 0%, semantic to 10-30%. In those cases, focus on provider-side prompt caching for the stable portion of the prompt (system message, static context) and use Redis only for the cacheable subsets. See also: Agent Loop Cost Optimization Guide for patterns that apply when the dynamic context is minimal.


Production Considerations: Monitoring, Metrics, and Pitfalls

Track four metrics from day one: cache hit rate, cache write rate, average Redis latency (target <5ms), and LLM call rate. A drop in hit rate usually means either your prompts got more dynamic (check for user-specific data leaking into messages that should be static) or your TTLs expired a large batch of keys simultaneously — stagger expiry with jitter: `TTL_SECONDS + random.randint(-3600, 3600)`.

Security: never cache prompts or responses that contain PII, credentials, or user-specific sensitive data. The cache key is a SHA-256 hash, so the prompt itself is not stored in the key — but the response is stored in plaintext in Redis. Use Redis AUTH (`requirepass`) and TLS in transit for all production deployments. For GDPR compliance, implement a per-user cache namespace that you can wipe on deletion request: prefix keys with a hashed user ID and flush that prefix on account deletion.

Memory sizing: a typical LLM response is 300-1,000 tokens, which serializes to roughly 400-1,500 bytes. At 1M cached responses with 1,000 bytes average: 1GB of Redis memory. Redis Cloud's 1GB plan starts at $23/month. At 10M cached responses: 10GB, ~$170/month. The economics still favor caching vs. API costs at essentially any scale. For the semantic cache, each entry additionally stores a 1,536-float embedding (6KB for text-embedding-3-small's 1536-dim vectors). Factor this into your memory budget when using the semantic layer.

One common pitfall: caching streaming responses. If you use SSE streaming for a better user experience, your response arrives token by token and you need to buffer the full response before storing it. Buffer into a list, join, then write to Redis after the stream closes. Do not try to cache mid-stream. Another pitfall: caching tool-use responses in multi-step agent loops. Cache the final consolidated answer, not intermediate tool-call outputs — those are session-specific and should not be shared across users. For more on reducing costs in agent loops specifically, see How to Reduce Token Usage in Prompts.


Semantic Cache with Llama 3.x and Local Embeddings

If you are running a self-hosted Llama 3.x model (Meta Llama 3.1 8B, 70B, or 405B via Ollama, vLLM, or Together AI), the same Redis caching patterns apply — the LLM client is just different. For embeddings, you can use a local embedding model (e.g., `nomic-embed-text` via Ollama at zero per-query cost) rather than paying OpenAI's text-embedding-3-small rate. The embedding dimension varies by model — `nomic-embed-text` is 768-dim vs text-embedding-3-small's 1536-dim, so update `DIM` in your Redis index definition accordingly.

```python import ollama def embed_local(text: str) -> list[float]: resp = ollama.embed(model="nomic-embed-text", input=text) return resp["embeddings"][0] # Drop-in replacement for the embed() function in the semantic cache above ```

The quality of local embeddings is generally comparable to text-embedding-3-small for English-language FAQ-style matching. For multilingual workloads or highly technical text (legal, medical, code-heavy), text-embedding-3-large ($0.13/1M tokens from openai.com/pricing) or a fine-tuned domain-specific model will outperform local general-purpose embeddings. Measure similarity score distributions on your actual data before committing to an embedding model — the right choice depends heavily on your content.


Putting It Together: Full Redis Cache Class

Here is a production-ready `LLMCache` class combining all patterns — exact-match, semantic, TTL with jitter, versioning, and metric logging: ```python import hashlib, json, os, random, time import numpy as np import redis from openai import OpenAI from redis.commands.search.field import VectorField, TextField from redis.commands.search.indexDefinition import IndexDefinition, IndexType from redis.commands.search.query import Query class LLMCache: VERSION = "v1" DIM = 1536 INDEX = "llm_semantic_cache" DEFAULT_TTL = 86400 SIMILARITY_THRESHOLD = 0.92 def __init__(self, redis_url: str, openai_client: OpenAI): self.r = redis.from_url(redis_url) self.oai = openai_client self._ensure_index() def _ensure_index(self): try: self.r.ft(self.INDEX).info() except Exception: schema = ( TextField("$.response", as_name="response"), VectorField("$.embedding", "HNSW", {"TYPE": "FLOAT32", "DIM": self.DIM, "DISTANCE_METRIC": "COSINE"}, as_name="embedding"), ) self.r.ft(self.INDEX).create_index( schema, definition=IndexDefinition(prefix=["sc:"], index_type=IndexType.JSON), ) def _exact_key(self, messages, model): norm = json.dumps({"model": model, "messages": messages}, sort_keys=True) return f"llm:{self.VERSION}:" + hashlib.sha256(norm.encode()).hexdigest() def _embed(self, text: str) -> list[float]: return self.oai.embeddings.create( model="text-embedding-3-small", input=text ).data[0].embedding def _ttl_with_jitter(self, base_ttl: int) -> int: return base_ttl + random.randint(-base_ttl // 10, base_ttl // 10) def get(self, messages: list[dict], model: str) -> str | None: # Exact-match check key = self._exact_key(messages, model) hit = self.r.get(key) if hit: self.r.expire(key, self._ttl_with_jitter(self.DEFAULT_TTL)) # rolling TTL return hit.decode() # Semantic check prompt = messages[-1]["content"] if messages else "" vec = np.array(self._embed(prompt), dtype=np.float32).tobytes() q = Query("(*)=>[KNN 1 @embedding $vec AS score]").sort_by("score").return_fields("response", "score").dialect(2) results = self.r.ft(self.INDEX).search(q, query_params={"vec": vec}) if results.total: top = results.docs[0] if (1 - float(top.score)) >= self.SIMILARITY_THRESHOLD: # Backfill exact key self.r.setex(key, self._ttl_with_jitter(self.DEFAULT_TTL), top.response) return top.response return None def set(self, messages: list[dict], model: str, response: str): key = self._exact_key(messages, model) ttl = self._ttl_with_jitter(self.DEFAULT_TTL) self.r.setex(key, ttl, response) prompt = messages[-1]["content"] if messages else "" embedding = self._embed(prompt) sc_key = f"sc:{int(time.time() * 1000)}" self.r.json().set(sc_key, "$", {"response": response, "embedding": embedding}) self.r.expire(sc_key, ttl) def invalidate_version(self, new_version: str): self.VERSION = new_version ```

Usage is simple: instantiate once at application startup, then wrap every LLM call with `cache.get()` / `cache.set()`. The class handles exact-match lookup, semantic fallback, rolling TTL refresh, and backfilling the exact-match cache on semantic hits — all transparently. Wire it into any Python web framework (FastAPI, Flask, Django) by adding it to your app's dependency injection or request context.


What Not to Cache: Avoiding the Common Mistakes

Not every LLM call benefits from caching. Avoid caching: (1) responses that depend on current time ("what should I do today?", "what is today's weather?"); (2) responses that depend on user-specific state that changes frequently (account balance, recent orders, live inventory); (3) creative or generative tasks where users expect a fresh response each time (story generation, brainstorming); (4) moderation or safety classification where your policy may have changed since the cached response was written.

A practical filter: add a `cacheable: bool` flag to your prompt templates. Default to `True` for FAQ, classification, summarization, and structured extraction. Default to `False` for anything real-time, user-personalized, or creative. This is a cheap annotation step that prevents the most common category of caching bugs — returning a confidently wrong cached answer because the underlying reality changed.

Finally, measure your actual cache hit rate in production before reporting savings. Developers consistently overestimate hit rates in pre-launch projections. Instrument with counters (`cache_hits`, `cache_misses`, `semantic_hits`) from day one. If your hit rate is below 20% after a week of production traffic, your prompts are too dynamic for Redis caching to move the needle — shift focus to provider prompt caching for the stable system-prompt portion and the token reduction techniques that don't depend on repeated queries.

Continue your research on adjacent topics — calculators, rate limits, head-to-head comparisons, and guides.

Frequently Asked Questions

Does Redis caching work with streaming LLM responses?

Yes, but you need to buffer the full response before storing it. Accumulate streamed tokens into a string, then call `cache.set()` after the stream closes. Do not write to Redis mid-stream. Return cached responses synchronously (no streaming needed — they are already complete).

What similarity threshold should I use for semantic caching?

Start at 0.92 (cosine similarity). Log the similarity scores for real traffic and plot the distribution after a few thousand requests. If you see high false-positive rates (wrong answers returned), raise to 0.95-0.97. If hit rate is too low, try 0.88-0.90. The right value depends on how variable your query phrasing is.

Can I cache Claude Sonnet 4.6 and GPT-5 responses in the same Redis instance?

Yes — use model-namespaced keys: `llm:claude-sonnet-46:v1:` and `llm:gpt5:v1:`. Never share cache entries between models since their responses differ. Keep separate semantic cache indexes or add model as a metadata field and filter in your vector query.

How much does a Redis instance cost for LLM response caching?

Redis Cloud's free tier gives you 30MB (enough for ~30k short LLM responses). Their paid plans start at $7/month for 100MB. Upstash serverless charges $0.20 per 100k commands with no idle costs. At 1M cached responses averaging 1KB each, you need ~1GB of Redis memory, which costs $23-30/month on either platform. Compare to the LLM API costs you are eliminating.

What embedding model should I use for semantic caching?

text-embedding-3-small (OpenAI, $0.02/1M tokens, 1536-dim) is the best cost/quality balance for most English-language workloads. text-embedding-3-large ($0.13/1M tokens) adds quality for technical or multilingual content. For self-hosted setups, nomic-embed-text via Ollama is free and competitive for English FAQ-style content.

Should I use Redis caching or provider prompt caching?

Both, at different layers. Provider prompt caching (Anthropic, OpenAI) reduces cost on the stable prefix within a session — it is automatic and requires no infrastructure. Redis caching eliminates the API call entirely for repeated queries across sessions and users. They are complementary: use provider caching for your system prompt and tools, and Redis caching for complete query-response pairs that you know will repeat.

How do I handle cache invalidation when my model changes?

Increment the version in your cache key prefix (from v1 to v2). Old keys in the v1 namespace expire per their TTL. New calls write to v2. No manual flush required, no downtime. For immediate invalidation (e.g., a safety-critical prompt change), flush the old namespace: `redis-cli --scan --pattern 'llm:v1:*' | xargs redis-cli del`.

Know your baseline before you optimize.

Paste your monthly token volume into our AI Prompt Cost Calculator to get the exact line-item bill across GPT-5, Claude Sonnet 4.6, Gemini 2.5 Pro, and every major model. Then you will know exactly how much a 65% Redis cache hit rate saves you. [Calculate now →](/blog/ai-prompt-cost-calculator)

Browse all prompt tools →