The Complete Guide to Prompt Caching: Cut LLM Costs by 90%
Master prompt caching to cut LLM costs by 90% and reduce latency by 75%, by AdaL CLI team.
What You'll Learn
- Why LLMs recompute the same tokens repeatedly (and waste money)
- The mathematical foundation: KV cache and attention mechanism
- How to structure prompts for maximum cache hit rates
- Provider-specific strategies (Anthropic, OpenAI)
- Real-world cost savings and performance benchmarks
The Problem: Paying for the Same Tokens Over and Over
If you're building with LLMs, you've likely noticed a pattern: your application sends similar context repeatedly across requests.
Example: Coding assistant
Request 1:
System prompt: "You are an expert Python developer..." (2000 tokens)
User query: "Write a function to parse JSON" (10 tokens)
Request 2:
System prompt: "You are an expert Python developer..." (same 2000 tokens!)
User query: "Write a function to validate email" (10 tokens)
Without caching, you're paying full price for those 2000 system prompt tokens on every single request.
For an application handling 1000 requests/day, that's 2 million tokens of redundant processing.
With prompt caching:
- First request: Pay full price (cache write)
- Subsequent requests: Pay 10% of price (cache read)
- Result: 90% cost reduction + 75% faster responses
How LLMs Generate Text (Simplified)
The Math Formula
The attention mechanism in transformers follows this mathematical formula:
Where:
- $Q = XW_Q$ (Query matrix: embeddings × learned query weights)
- $K = XW_K$ (Key matrix: embeddings × learned key weights)
- $V = XW_V$ (Value matrix: embeddings × learned value weights)
- $X$ is the input embeddings matrix (n tokens × d dimensions)
- $W_Q, W_K, W_V$ are learned weight matrices (d × d dimensions)
- $d_k$ is the dimension of the key vectors (used for scaling)
- $\text{softmax}$ converts scores to probabilities that sum to 1
Step-by-step breakdown:
1. Compute Q, K, V:
2. Calculate attention scores:
3. Mask future tokens (for autoregressive generation):
4. Apply softmax to get attention weights:
5. Mix values using attention weights:
Without Caching
For a 4-token prompt, we calculate:
Token 1: Q[1], K[1], V[1]
Token 2: Q[1], K[1], V[1], Q[2], K[2], V[2] ← Recalculating 1!
Token 3: Q[1], K[1], V[1], Q[2], K[2], V[2], Q[3], K[3], V[3] ← Recalculating 1 and 2!
Token 4: Q[1], K[1], V[1], Q[2], K[2], V[2], Q[3], K[3], V[3], Q[4], K[4], V[4] ← Recalculating 1, 2, 3!
This is O(n²) computation—doubling the tokens quadruples the work.
With KV Caching
We cache K and V from previous tokens and combine them with newly computed values.
Mathematical formula:
At step t, instead of computing everything:
We compute only the new token:
And append to cache:
Why this works: Each $K_i = X_i W_K$ depends only on fixed input embedding $X_i$, so once computed, it never changes → cache it!
Complexity:
- Without caching: $O(t^2 \cdot d^2)$ total for $t$ tokens
- With caching: $O(t \cdot d^2)$ total for $t$ tokens
- Speedup: factor of $t$
What gets cached:
- K (Key):*Information about what each token represents
- V (Value): The actual data to use from each token
- NOT cached: Q (Query) - changes each step based on what we're currently generating
Now it's O(n) computation—doubling the tokens only doubles the work.
Storage Requirements and Cost Implications
Memory Footprint
Per-token KV cache size formula:
Example: 7B model (32 layers, 32 KV heads, 128 head_dim, float16 = 2 bytes):
- Per token: ~0.5 MB
- 1K context:~512 MB per request
- 100 concurrent requests: ~50 GB just for KV cache
Larger models (e.g., 70B):
- Per token: ~2.5-5 MB
- 200k tokens: ~1 GB of GPU memory per cached prompt
Why Caching Costs Providers More
Memory is expensive:
- GPU VRAM costs significantly more than computation time
- Each cached prompt occupies memory for 5-10 minutes (cache TTL)
- Providers must reserve memory even when GPUs are idle
- Scales with concurrent users and context length
Provider pricing reflects this:
Anthropic:
- Uncached tokens: $3.00/M input tokens
- Cache writes: $3.75/M tokens (+25% premium)
- Cache reads: $0.30/M tokens (90% discount)
- Why? Cache writes reserve memory for 5-10 minutes
OpenAI:
- Automatic caching (no control)
- ~50% cache hit rate
- Cost included in standard pricing
- Why? Amortizes storage cost across all users
The tradeoff:
- Computation: Cheap to repeat, providers can schedule efficiently
- Storage: Expensive to hold, providers must reserve capacity upfront
- Caching trades computation cost (one-time) for storage cost (duration-based)
The Performance Win
Example: 200k token prompt
Without caching (first request):
- Process all 200k tokens
- Time to first token: ~8 seconds
- Cost: $0.60 (200k tokens × $3/M)
With caching (subsequent requests):
- Reuse cached K/V for 200k tokens
- Process only new tokens (e.g., 100 tokens)
- Time to first token: ~2 seconds (75% faster!)
- Cost: $0.06 (200k cached tokens × $0.30/M) (90% cheaper!)
Provider Differences
OpenAI:
- Automatic (you don't control it)
- ~50% cache hit rate
- Free (built into pricing)
Anthropic:
- Explicit control via API
- ~100% cache hit rate when you ask for it
- Costs 25% more to cache, but saves 90% on cache hits
- Cache lasts 5-10 minutes
How to Structure Prompts for Maximum Cache Hits
The key to effective prompt caching is putting static content at the beginning and dynamic content at the end.
The Golden Rule: Static First, Dynamic Last
✅ Optimal Structure (maximizes cache hits):
┌────────────────────────────────────────┐
│ Static system prompt (5000 tokens) │ ← Forms cacheable prefix
│ Tool definitions (3000 tokens) │ ← Extends cacheable prefix
│ Project context (2000 tokens) │ ← Still cacheable
│ Conversation history (1000 tokens) │ ← Partially cacheable
│ Current user query (100 tokens) │ ← Dynamic, not cached
└────────────────────────────────────────┘
Total cacheable: 10,000+ tokens
❌ Bad Structure (breaks caching):
┌────────────────────────────────────────┐
│ Current query (100 tokens) │ ← Dynamic first!
│ System prompt (5000 tokens) │ ← Can't cache (no prefix match)
└────────────────────────────────────────┘
Total cacheable: 0 tokens
Why this matters: KV pairs for token at position $i$ depend ONLY on tokens $1$ through $i$ (causal attention). Same prefix → identical KV cache, regardless of what follows.
Provider-Specific Strategies
Anthropic (Explicit Cache Control)
```json
{
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "You are an expert Python developer...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Write a function to parse JSON"
}
]
}]
}
```
Three-Tier Strategy:
```
Tier 1: System instructions (static)
→ Mark with cache_control
→ ~5000 tokens
→ 5-min TTL
Tier 2: Tool definitions (semi-static)
→ Mark with cache_control
→ ~3000 tokens
→ 5-min TTL
Tier 3: Dynamic content (never cached)
→ No cache_control
→ User query, timestamps, etc.
```
Requirements:
- Minimum: 1024 tokens per cached block
- Maximum: 4 cache breakpoints
- TTL: 5 minutes (ephemeral)
OpenAI (Automatic Caching)
```json
{
"messages": [{
"role": "user",
"content": "System instructions...\n\nUser query..."
}]
}
```
How it works:
- Provider detects common prefixes automatically
- Caches prefixes ≥ 1024 tokens
- Returns `cached_tokens` in usage metrics
No configuration needed, but structure still matters!
Even with automatic caching, ordering affects hit rates:
```python
# Good: Static content first
prompt = f"""
{system_instructions} # 5000 tokens - cached
{tool_definitions} # 3000 tokens - cached
{conversation_history} # 2000 tokens - partially cached
{current_query} # 100 tokens - not cached
"""
# Bad: Dynamic first breaks prefix matching
prompt = f"""
{current_query} # 100 tokens - dynamic
{system_instructions} # Can't cache!
"""
```
Practical Implementation
1. Build prompts from static to dynamic:
```python
def build_prompt(system, tools, history, query):
"""Order matters: Static → Dynamic"""
return "\n\n".join([
system, # Layer 1: Static (always cached)
format_tools(tools), # Layer 2: Semi-static
format_history(history), # Layer 3: Grows over time
query # Layer 4: Always unique
])
```
2. Check minimum size:
```python
def is_cacheable(content: str) -> bool:
"""
1 token ≈ 4 characters
Minimum: 1024 tokens ≈ 4096 characters
"""
return len(content) >= 4096
```
3. Monitor performance:
```python
def track_cache_metrics(response):
"""Track hit rate and savings"""
writes = response.usage.cache_creation_input_tokens
reads = response.usage.cache_read_input_tokens
if writes > 0:
print(f"Cache write: {writes} tokens")
if reads > 0:
print(f"Cache hit: {reads} tokens saved")
hit_rate = reads / (reads + writes) if (reads + writes) > 0 else 0
print(f"Hit rate: {hit_rate:.1%}")
```
Edge Cases and Best Practices
Cache Expiration (5-10 min TTL):
- First request after expiration = cache miss
- Subsequent requests = cache hits
- Don't over-optimize for this!
Prompt Variations:
```python
# Bad: Mixing static and dynamic
system = f"You are a helpful assistant. Time: {now()}"
# Cache invalidated every request!
# Good: Keep static content pure
system = "You are a helpful assistant"
context = f"Session started: {now()}"
```
Very Long Prompts:
- KV cache memory grows linearly with length
- Keep static content focused and relevant
- Remove outdated examples
High Concurrency:
- Multiple requests with same prefix share cache
- Provider handles this transparently
- No special handling needed
When Caching Provides Maximum Value
High-impact scenarios:
1. High request volume:*1000+ requests/day with similar prompts
2. Large static context: System prompts > 2000 tokens
3. Multi-turn conversations: Repeated context across turns
4. Tool-heavy applications: Many tool definitions
5. Agent workflows: Sequential steps sharing context
Cost Impact Example
Coding Assistant: 10,000 requests/day
Configuration:
- Static system prompt: 8,000 tokens
- Average user query: 200 tokens
- Cache hit rate: 90%
Without caching:
Total: 10,000 × 8,200 = 82M tokens/day
Cost (Anthropic): 82M × $3/M = $246/day
With caching:
Cache writes: 1,000 × 8,000 @ $3.75/M = $30
Cache reads: 9,000 × 8,000 @ $0.30/M = $21.60
Uncached: 10,000 × 200 @ $3/M = $6
Total: $57.60/day
Savings: 76% cost reduction + 40-50% faster responses
When Caching May Not Help
- Unique prompts every request
- Very short prompts (<1024 tokens)
- Highly dynamic context
- Low request volume (<100 requests/day)
Why KV Caching Works: The Mathematical Property
Why KV Caching Works: The Mathematical Property
Core Principle: Block Matrix Multiplication
KV caching is enabled by block matrix multiplication - the property that lets us partition matrices into blocks and multiply them block-wise:
Decomposition of Attention Computation
Without caching (step $t$):
Where
(all tokens up to position $t$)
With caching (step $t$):
Using block matrix multiplication, we partition:
This expands to:
The key observation:
The resulting matrix has 4 blocks:
1. Top-left: $Q_{1:t-1} K_{1:t-1}^T$ (already computed in previous steps - not needed)
2. Top-right: $Q_{1:t-1} K_t^T$ (not needed - past tokens can't attend to future tokens due to causal attention constraint)
3. Bottom-left: $Q_t K_{1:t-1}^T$ ← **Uses cached $K_{1:t-1}$ (new token attends to past)**
4. Bottom-right: $Q_t K_t^T$ ← **Uses new $K_t$ (new token attends to itself)**
Note: In autoregressive generation, token $i$ can only attend to tokens $\leq i$. Past tokens' representations are frozen and don't get updated when new tokens arrive - they never look at future tokens.
Only the bottom row is needed for generating token $t+1$:
Expanding as dot products:
Each score** $Q_t \cdot K_i^T$ depends only on:
1. $Q_t$ (just computed)
2. $K_i$ (computed in step $i$ and **cached**)
Since $K_i = X_i W_K$ depends ONLY on input embedding $X_i$ (fixed once processed) → cache all $K_i$ values!
Why Q Is Not Cached
$Q_t$ represents "what am I looking for **right now**" - it depends on the current generation position and changes every step. In contrast, $K_i$ and $V_i$ represent "what information exists in the context" for token $i$, which doesn't change once that token is processed.
Key Takeaways
1. What's cached: K and V matrices (intermediate calculations in attention)
2. Why it works: Linearity of matrix multiplication allows decomposition into cached (static) and fresh (dynamic) parts
3. Performance gain: 75% faster, 90% cheaper for cached tokens
4. Practical impact: Makes long conversations and repeated prompts much more efficient
AdaL Usage
We cache three parts of our prompt:
1. System prompt (~14k tokens) - your identity and instructions
2. Project context (~8k tokens) - tools, AGENTS.md, working directory
3. Chat history (variable) - all previous conversation turns
This means the more you chat in a session, the more efficient each subsequent turn becomes!
References
Full technical explanation: `wip_docs/kv_cache_technical_explanation.md`
Original article: [ngrok.com blog post on prompt caching](https://ngrok.com/blog-post/prompt-caching-with-llms-explained)
AdaL implementation: `deep_research/src/deep_research/prompt_caching.py`
https://sankalp.bearblog.dev/how-prompt-caching-works/



Some of the Latex/Markdown is not rendered correctly.