Prompt Caching from First Principles

On the Summation app, every LLM request starts with the same system prompt plus a client-specific semantic layer: table schemas, column descriptions, business rules. This prefix is thousands of tokens long and does not change for hours. The user’s actual question is tens of tokens.

We were paying full price to reprocess that identical prefix on every request. Cost looked upside down. Time to first token was dominated by redundant work.

We noticed that Prompt Caching fixed this. We observed considerable reduction in time-to-first-token latency for cached prompts and the total input token costs dropped by 10x.

The official documentation goes something like Providers store the intermediate computation for an identical prefix and reuse it across requests, so you skip the expensive prefill and only pay full price for the new suffix. But most explanations stop at “it makes identical prefixes cheaper” and never explain the mechanism. Without understanding the mechanism, you cannot debug cache misses, reason about why whitespace breaks hits, or understand why temperature does not affect caching.

I wanted to understand the mechanism properly, so I read through several resources, built my own mental model, and wrote it all down here. This post traces a prompt through the transformer pipeline, stage by stage, to pinpoint what can be reused, what cannot, and what design choices give you reliable cache hits.

The assembly line

When you send text to an LLM, it passes through four stages. Think of it as a factory floor where raw material enters one end and a predicted token exits the other.

Each stage transforms the data into something richer. Text becomes numbers. Numbers become meaning. Meaning becomes understanding. Understanding becomes prediction.

Let’s trace a real prompt, "May the force", through each stage.

Stage 1: Tokens

LLMs don’t read text. They can’t. They’re mathematical functions that operate on numbers. So the first step is converting your text into a sequence of integers.

This is what the tokenizer does. It chops your text into chunks called tokens, but not in an obvious way. It doesn’t split on spaces or characters. It uses subword segmentation, a scheme learned during training that balances vocabulary size against sequence length.

"May the force" becomes [7108, 262, 3081]. Three integers. The word “May” maps to 7108. Always. No matter what comes before or after it. No matter what you set the temperature to. It’s a deterministic lookup table, nothing more.

Key property

The same text always produces the same token IDs. This is pure lookup, no randomness, no model weights involved. Remember this. It’s the foundation that makes caching possible.

Some tokens map neatly to words. Others map to fragments. “understanding” might become ["under", "standing"]. The tokenizer doesn’t care about meaning. It cares about compression.

Stage 2: Embeddings

Now we have three integers: [7108, 262, 3081]. But these are just IDs. They don’t carry meaning by themselves. The number 7108 doesn’t know that “May” is a modal verb, or that it often starts a request or a wish.

The embedding stage fixes this. It maps each token ID into a high-dimensional vector space, where distance and direction encode relationships the model learned during training.

The animation above shows two dimensions. In reality, modern LLMs use thousands. GPT-4 uses 12,288 dimensions. Claude uses a similar order. These aren’t arbitrary. Each dimension captures some learned aspect of meaning. We can’t name them (they emerge from training), but the model uses them to encode everything from grammar to factual knowledge.

After embedding, each token becomes a dense vector:

Token 1 ("May"):   [0.12, -0.84, 0.33, 0.07, ...]  ← 3,072 numbers
Token 2 ("the"):   [0.56,  0.21, -0.17, 0.91, ...]
Token 3 ("force"): [-0.03, 0.67, 0.44, -0.22, ...]

Stack these row by row and you get a matrix. 3 rows (one per token) by 3,072 columns (one per dimension). Let’s call this matrix E.

There’s one more trick here: positional encoding. The model adds a position vector to each embedding. Token “the” at position 2 gets a different final embedding than “the” at position 5.

This is how the model knows that “dog bites man” and “man bites dog” mean different things. Same tokens, different positions, different embeddings.

Key property

Like tokenization, embedding is deterministic. The same token at the same position always produces the same embedding vector. Same weights, same input = same output. No exceptions.

Stage 3: Attention

This is the heart of a transformer. This is where the model goes from “I have three independent word-meanings” to “I understand this phrase.”

Here’s the fundamental question attention answers:

For each token, how much should it care about every other token?

“May” by itself is vague. Is it a month? A permission verb? A wish?

But “May” followed by “the” followed by “force”? Now the model knows this is a wish. Attention is the mechanism that wires these relationships together.

The Q, K, V framework

The model learns three projections, traditionally called:

Q (Query): what this token is looking for
K (Key): what this token offers / what kind of information it contains
V (Value): the content that gets mixed into the result

I like this analogy: Think of it like a library.

You walk in with a question (your Query). Every book on the shelf has a label on its spine (its Key) describing what information it contains. When your question matches a label, when the Query aligns with a Key, you pull that book off the shelf and read its contents (its Value).

The model learns three weight matrices during training, W_Q, W_K, and W_V, that transform embeddings into these three roles:

Q = E × W_Q     ← "What am I looking for?"
K = E × W_K     ← "What do I contain?"
V = E × W_V     ← "Here's my actual content."

That is a useful way to understand the mechanism. But here is the more precise version:

This happens in every transformer layer, not just once at the beginning.

In each layer, the model takes that layer’s current hidden state (which is already context-enriched from previous layers) and forms that layer’s Q, K, and V.

So the model is not computing one global Q, K, V and calling it a day. It is doing this repeatedly, layer after layer, refining the representation each time.

The attention calculation

Now the elegant part. To decide how much token i should attend to token j, the model compares: Query of token "i" with Key of token "j"

That comparison is a dot product. Conceptually:

high match -> pay more attention
low match -> pay less attention

For a whole sequence, this becomes a matrix of scores:

Attention Scores = Q × K^T

For our three tokens, this produces a 3×3 matrix of scores (see below).

But there’s a catch. During generation, a token shouldn’t be able to peek at tokens that come after it. Those haven’t been generated yet. So we apply a causal mask: zero out everything above the diagonal. Token 1 can only see itself. Token 2 can see tokens 1 and 2. Token 3 can see all three.

This has a subtle but important consequence. Because token 2 never sees token 3, its K and V values can’t possibly depend on token 3. Adding more tokens to the end of the sequence doesn’t change what earlier tokens computed. Keep this in the back of your mind. It’s going to matter a lot soon.

After masking, we normalize each row with softmax (so each row sums to 1.0), giving us a proper probability distribution of how much attention to give each previous token.

Finally, we multiply these attention weights by V, the actual content:

Output = softmax(Q × K^T) × V

Each token’s output is now a weighted blend of the values of the tokens it attended to. “May” is no longer just a vague modal verb. It has been enriched with information from the tokens it can see.

This entire process happens once per attention layer, and modern models stack dozens of them. GPT-4 has roughly 120 layers. Each layer refines the representations, building increasingly abstract understanding.

Why separate Q, K, and V?

Why not just compare embeddings directly? Because the model needs different projections for different jobs.

Query projection asks: what information do I need right now?
Key projection asks: what kind of information does this position contain?
Value projection asks: what actual content should be mixed if selected?

Splitting these roles gives the model much more expressive power than trying to do everything with one representation.

Let’s tally up the math

For our tiny 3-token prompt, even this simplified walkthrough already includes:

tokenization
embedding + position handling
Q, K, V projections (in each layer)
attention score computation
masking + softmax
weighted value mixing
and later, output projection + sampling

And that is just for a toy sequence.

Real models repeat this across many layers, with additional feed-forward transformations between attention blocks.

You do not need every implementation detail to understand prompt caching.

But you do need to feel this part:

Processing a prompt is a lot of expensive, structured math.

So if a large prefix repeats across requests, redoing all of that work from scratch every time is a very real cost and latency problem, not a theoretical one.

But we’ve only processed the input so far. How does the model turn all of this into an actual word?

Stage 4: Prediction

After the final attention layer, each token position holds a rich, context-aware embedding. But we only care about the last position. That’s where the model’s prediction lives.

The model takes that final embedding and multiplies it by one more weight matrix, W_out, which projects it from 3,072 dimensions to the full vocabulary size (50,000+ tokens). This gives us a raw score, called a logit, for every word the model knows.

logits = final_embedding × W_out     ← one score per vocabulary word
probs  = softmax(logits)             ← convert to probabilities

Softmax turns these raw scores into probabilities that sum to 1.0. The word with the highest probability is the model’s best guess. For "May the force", it might assign 0.87 to “be”, 0.06 to “always”, 0.03 to “remain”, and so on.

This is where temperature enters the picture. Temperature scales the logits before softmax. Low temperature (0.1) makes the distribution spiky, so the model almost always picks the top word. High temperature (1.5) flattens it, making surprising choices more likely. But temperature only touches this final sampling step. Everything upstream (tokens, embeddings, K, V, attention) is already done.

Watch the temperature slider. As it sweeps from focused (0.1) to creative (2.0), the probability bars reshape dramatically. But none of this affects K or V. They were computed in Stage 3 and are long done by now.

The model samples from this distribution and produces a single token: “be”.

Now what? We need to generate the rest of the response.

The generation loop

LLMs generate text one token at a time. After running our 3-token prompt through all the stages, the model predicts “be”. Great. But to get the next word, it appends “be” to the input, making it "May the force be", and runs the entire pipeline again.

Look at the orange boxes. Every single step, the model recomputes K and V for the tokens it already processed in the previous step. “May” has been through the pipeline three times by step 3. Its K and V values haven’t changed. They can’t change. But the model dutifully recomputes them anyway.

By the time you’ve generated 100 output tokens, the model has computed K and V for your original prompt 100 times. For a 2,000-token system prompt, that’s 200,000 redundant K,V row computations.

This is breathtakingly wasteful.

Two very different bottlenecks

Inference engineers split this loop into two phases with very different performance profiles.

Prefill processes your entire input prompt in one pass. All tokens are known upfront, so the GPU crunches them in parallel across thousands of CUDA cores. This phase is compute-bound. The bottleneck is raw arithmetic throughput.

Decode generates tokens one at a time. Each step loads the entire KV history from GPU memory just to produce a single token. This phase is memory-bound. The GPU spends most of its time waiting for data to arrive from VRAM, not doing math.

Why does this matter? Because it tells you where the savings are. KV caching (coming up next) eliminates redundant computation during decode. Prompt caching (further down) eliminates redundant prefill across requests entirely. Two different optimizations attacking two different bottlenecks.

The aha moment: KV caching

Here’s the insight. Once you see it, you can’t unsee it.

The core insight

The K and V values for a token depend only on that token’s embedding and the model’s weights. They don’t change when new tokens are added to the sequence.

When the model is generating text, it appends one token at a time. At each step, it needs to attend over everything that came before.

Without caching, that means the model keeps redoing work for earlier tokens again and again.

But causal attention gives us a huge shortcut.

A token is only allowed to look left (earlier tokens), never right (future tokens). So once the model has already computed the relevant attention state for earlier tokens, appending a new token later cannot retroactively change that earlier work.

That is exactly what KV caching exploits.

The model stores the K and V tensors (for each layer) for the tokens it has already processed. On the next generation step, it computes K and V only for the new token, appends them to the cache, and reuses the rest.

The important thing to notice in the animation is not the exact matrix sizes. It is the pattern:

Without KV cache: the model keeps recomputing old rows
With KV cache: the model computes just one new row and reuses history

That is why generation becomes practical at all. Modern LLMs are not “smart enough to avoid waste” by accident. They are engineered to avoid this exact waste.

Wait, why not cache Q too?

This is a great question, and the answer reveals what each tensor is doing.

K and V are the stored history (what earlier tokens have, and what content they offer)
Q is the question being asked right now

During decode, the model only needs the Query for the newest token at each layer. There is not much to save by caching Q because there is only one new Q row per step anyway.

So in practice, we cache the large reusable history (K and V), not the fresh per-step question (Q).

Prompt caching: taking it across requests

Everything above happens within a single API call. Every modern LLM provider already does KV caching internally during generation. You don’t opt into it. You don’t pay extra for it. It’s just how inference works. Without it, generating a 500-token response would reprocess the entire input 500 times.

So KV caching saves redundant work within a request. But what about across requests?

Think about my app from the beginning of this post. Every API call sends the same 2,000-token system prompt. Each call computes K and V for those 2,000 tokens from scratch, then throws them away when the request ends. The next request computes the exact same K and V matrices again. That’s the waste prompt caching eliminates.

It stores K and V matrices from one request and reuses them in future requests that share the same prefix.

The waste scales brutally. For a model like Llama 7B (32 layers, 32 attention heads, 128 dimensions per head), storing K and V for a single token takes roughly 0.5 MB of GPU memory. A 2,000-token system prompt eats about 1 GB per request. Run 100 concurrent requests with the same system prompt and you’re holding 100 identical copies. 100 GB of GPU VRAM for the same matrices, computed 100 times.

Here’s a concrete example. You’re building a customer support bot. Every request starts with the same 2,000-token system prompt (your instructions, persona, formatting rules). Then comes the user’s question, which varies each time:

Request 1:  [system prompt]  +  "How do I reset my password?"
Request 2:  [system prompt]  +  "What are your business hours?"
Request 3:  [system prompt]  +  "Can I get a refund?"

The system prompt produces identical K and V matrices across all three requests. Same tokens, same positions, same model weights. The math is the same every time.

With prompt caching, the provider stores these K and V matrices after the first request and reuses them for subsequent requests.

The numbers

The impact is significant on both cost and latency:

Metric	Without Caching	With Caching
Tokens processed	2,050 (all computed)	50 computed + 2,000 cached
Cost per cached token	full price	10x cheaper (both OpenAI and Anthropic)
Time to first token	~800ms	~120ms
Latency reduction	baseline	up to 85% (Anthropic’s published number)

How providers handle it

OpenAI takes the automatic approach. Their infrastructure hashes your prompt prefix and routes requests to servers that likely have matching cached KV blocks. No API changes needed. The catch: routing is probabilistic, so you might land on a server without your cache. In practice, you see roughly 50% cache hit rates when sending identical prompts sequentially. Cached data persists for 5-10 minutes of inactivity.

Anthropic gives you explicit control. You place cache_control breakpoints in your message array, marking exactly where caching should apply. Their system guarantees lookup at those breakpoints. The result: near-100% hit rates when configured properly. Cached data persists for 5 minutes by default, with each hit resetting the TTL. More deterministic, more predictable.

Both charge roughly 90% less for cache hits than regular input tokens. Anthropic charges a 25% premium on cache writes (the first time a prefix is cached), while OpenAI absorbs the write cost. Either way, the economics are compelling if your prompts share stable prefixes.

What doesn’t affect the cache

Remember Stage 4? Temperature, top_p, and top_k all operate after attention, during sampling. They control how the model picks from the probability distribution, but they don’t touch K or V.

So the same cached K,V matrices remain valid regardless of your sampling settings. You can hit the cache with temperature=0 one request and temperature=1.5 the next. The cached K and V don’t care.

What invalidates the cache?

Changing any part of the cached prefix, even a single token, invalidates the cache. Tokens are position-dependent: changing token 5 means every token from position 5 onward has different K and V values. The cache is all-or-nothing for a given prefix. This is why you should put stable content (system prompts, instructions) first, and variable content (user messages) last.

The GPU memory problem

But wait. If we’re storing K,V matrices across requests, where do they live? GPU memory. And GPU memory is expensive and finite.

The naive approach: pre-allocate a contiguous chunk of GPU memory for each request, sized for the maximum possible sequence length. Model supports 4,096 tokens? Allocate K,V space for 4,096 tokens. Even if the request only uses 200.

This is internal fragmentation. You reserved a hotel room for a week but checked out after one night. The room sits empty, but nobody else can book it.

It gets worse. Three requests arrive needing 50, 100, and 50 tokens. They each get contiguous blocks of GPU memory. The middle request finishes first, freeing a 100-token gap flanked by the two 50-token requests. A new 150-token request arrives. There’s enough total free space, but it’s split across non-adjacent gaps.

This is external fragmentation. Enough total free space, scattered into unusable chunks. The same problem that plagued operating systems in the 1960s.

From OS pages to KV blocks

Operating systems solved this decades ago with paging. Instead of allocating contiguous memory, break it into fixed-size pages scattered anywhere in physical memory. A page table maps logical addresses to physical locations.

Paged attention (from the vLLM project) applies the same idea to KV caches. The system maintains a pool of fixed-size blocks, each holding K,V data for a small number of tokens (typically 16). Blocks can live anywhere in GPU memory. A block table maps each request’s logical token positions to physical blocks.

A 200-token request doesn’t need one contiguous slab. It needs 13 blocks (⌈200/16⌉), scattered wherever there’s room. When the request finishes, those 13 blocks return to the pool individually. No gaps. No fragmentation.

Block hashing: the cache lookup

Now for the clever part. How does the system know if a new request can reuse blocks from a previous one?

It hashes each block’s tokens. But not just the tokens in that block. The hash of block n includes the hash of block n-1, forming a chain.

Your system prompt is 48 tokens long, split into 3 blocks of 16. Request A arrives, computes K,V for all 48 tokens, and stores them in 3 physical blocks indexed by hash. Each hash chains the previous block’s hash as input, so hash(block 1) includes hash(block 0).

Now Request B arrives with the same 48-token prefix plus a different user question. The system computes block hashes and walks them sequentially: HIT, HIT, HIT, MISS. Three cached blocks reused, one new block computed. 75% of prefill eliminated.

The walk stops at the first miss. If block 2 were a miss, block 3’s hash (built on block 2’s hash) would also be wrong. Everything downstream of a miss is automatically invalid. This is find_longest_cache_hit() in vLLM’s codebase: walk from block 0, return the last hit.

Why chain parent hashes?

Why include the parent’s hash? Why not just hash each block’s 16 tokens independently?

Because of causal attention. Token 20 (in block 1) attends to tokens 0-19. Its K and V values depend on the tokens before it. If you had different tokens in block 0 but identical tokens in block 1, the K,V values for block 1 would be different despite containing the same token IDs. The context changed.

By chaining parent hashes, matching block 2’s hash guarantees blocks 0 and 1 are also identical. One comparison validates the whole history. You don’t need to check each block separately.

Only prefixes, not substrings

This is why caching is inherently prefix-shaped. You can’t pluck out tokens 500-600 from the middle of a prompt and cache them independently. Their K and V values depend on tokens 0-499 (causal mask). Without knowing the full prefix, the cached K,V data is meaningless. Caching works from the start of the sequence forward, or not at all.

What happens when Requests A and B are both active and sharing blocks 0-2?

Each shared block maintains a reference count. When Request A created the blocks, ref counts were 1. When Request B reuses them, counts bump to 2. When Request A finishes, counts drop to 1. The blocks stay alive because Request B still needs them.

When the last request using a block finishes, the ref count hits 0. But the block doesn’t get freed immediately. It stays in the cache with an LRU (least recently used) timestamp. If a new request arrives with a matching hash before the block gets evicted, it gets a second life. This is why providers mention TTLs (5-10 minutes): blocks survive in the cache for a window, betting that similar requests will arrive soon.

Concurrent reads are safe because cached K,V data is read-only. The values for a cached prefix never change. There’s nothing to synchronize. Two requests reading the same physical block simultaneously is no different from two processes reading the same file.

Practical tips for cache hits

Understanding the theory is one thing. Getting consistent cache hits in production is another.

Put stable content first. System prompts, persona definitions, formatting rules, tool schemas. These form the prefix of every request. Variable content (user messages, tool results) goes at the end. The cache matches from the start of your token sequence, so any change early on breaks the hash chain and invalidates everything downstream.

Keep context append-only. Building a multi-turn conversation? Append new messages. Don’t truncate the middle or reorganize earlier turns. Truncating breaks the prefix match. Appending preserves it.

Serialize deterministically. If your prompt includes JSON (tool definitions, structured data), use sort_keys=True or equivalent. {"name": "Alice", "age": 30} and {"age": 30, "name": "Alice"} are semantically identical but produce different tokens. Different tokens, different block hashes, cache miss.

Pin your tool definitions. Many frameworks regenerate tool schemas per request with slightly different ordering or whitespace. Each variation produces different tokens and breaks the hash chain. Treat tool definitions as static prefix content.

The complete picture

Let’s zoom all the way out.

Text → Tokens: deterministic lookup. Same text = same tokens. Always.
Tokens → Embeddings: deterministic transformation. Same token at same position = same embedding. Always.
Embeddings → K, V: deterministic matrix multiplication. Same embedding × same weights = same K, V. Always.
K, V + new Q → Attention → Output: the only computation that must happen for each new token.
Output → Logits → Sampling: temperature and other sampling parameters live here, completely downstream of K and V.

Prompt caching works because steps 1-3 are deterministic. Same input, same position, same model = same K and V. Every time. No exceptions. The provider stores these intermediate results and lets you skip the expensive computation. Step 5 is irrelevant to caching entirely.

You pay 1/10th the cost because you’re paying for cache storage and lookup, not GPU cycles.

prompt = "May the force";
tokens = tokenize(prompt);

kv_cache = {};

while (true) {
    last_embedding = embed(tokens[tokens.length - 1]);

    if (tokens in kv_cache) {
        // Cache hit: reuse stored K,V, only compute for new token
        [k, v] = kv_cache[tokens];
        k = append(k, last_embedding * W_K);
        v = append(v, last_embedding * W_V);
    } else {
        // Cache miss: compute K,V for all tokens
        embeddings = embed(tokens);
        k = embeddings * W_K;
        v = embeddings * W_V;
        kv_cache[tokens] = [k, v];
    }

    // Attention: only the new token's Q, but all K and V
    q_new = last_embedding * W_Q;
    attention = softmax(q_new * transpose(k));
    output = attention * v;

    // Stage 4: project to vocabulary and sample
    logits = output * W_out;
    next_token = sample(softmax(logits));
    if (next_token === END) break;
    tokens.push(next_token);
}

The Feynman check

Feynman had a test: if you can’t explain something simply, you don’t understand it. So here’s yours. If you’ve followed this post, you should be able to answer these without scrolling back up:

Why does the same prompt always produce the same K and V matrices? Because K = E × W_K and V = E × W_V, where E is determined by the token identity and position (both fixed for a given prompt), and W_K, W_V are fixed model parameters. Deterministic inputs × deterministic weights = deterministic output.
Why can you change temperature without invalidating the cache? Temperature operates in Stage 4 (sampling), which is completely downstream of attention. K and V are computed in Stage 3. By the time temperature touches anything, K and V are already done.
Why is Anthropic’s explicit caching more reliable than OpenAI’s automatic approach? OpenAI distributes requests across servers and the cache is server-local. You might hit a server that doesn’t have your cached KV. Anthropic lets you explicitly mark cache breakpoints and guarantees routing, giving you near-100% hit rates.
What happens if you change one word in your system prompt? The token at that position changes, which changes its embedding, which changes its K and V. And because block hashes chain their parents, the hash for that block and every block after it changes too. The entire cache from that point onward is invalidated. You start fresh.
Why does prompt caching work across different users? Because the cache key is the content (the token sequence and its block hashes), not the requester. Two users sending identical system prompts produce identical tokens, identical block hashes, and identical K,V matrices. The math doesn’t know or care who triggered it. Same input = same output = same cache entry.

If you can answer all five, you don’t just understand prompt caching. You understand it well enough to predict its behavior in situations nobody has explained to you. That’s the difference between memorizing facts and building mental models.

If you want to go deeper, I recommend Sebastian Raschka’s Build a Large Language Model (From Scratch), Andrej Karpathy’s Neural Networks: Zero to Hero series, and the Transformer Explainer interactive visualization from Georgia Tech. For the implementation side (paged attention, vLLM internals, block-level hashing), Sankalp’s How Prompt Caching Works is excellent.