L / 005

Wait, They’re Caching LLM Outputs?

If you've ever watched an AI generate a paragraph, you might assume the hard part is the "being intelligent" bit. And sure, that's impressive. But a surprisingly big chunk of the engineering challenge is something far more mundane: not doing the same work twice.

Welcome to the world of caching in large language models, where the real enemy isn't complexity. It's redundancy.

The Traditional Approach: One Word at a Time

Most language models you've interacted with, think ChatGPT, Claude, or Llama, are what's called autoregressive. They generate text one word (technically, one "token") at a time, always left to right. Each new word needs to consider every word that came before it. The first word is easy. The hundredth word has to think about ninety-nine predecessors. The thousandth word? You can see where this is going.

Without any tricks, the computational cost spirals upward fast. Every new token means re-processing the entire conversation from scratch, like a student who rereads an entire textbook every time they turn a page. Technically functional, but not exactly efficient.

The Classic Fix: Writing Notes in the Margins

The solution is something called a KV cache, which stands for Key-Value cache, and it works on a beautifully simple observation. When a language model generates the word "blue" and then moves on to generate the next word, the word "blue" doesn't retroactively change. The past is settled. So why recalculate it?

Instead, the model stores the mathematical representations of every previous word (the "Keys" and "Values" from the attention mechanism, if you want to impress someone at a dinner party) and just looks them up when needed. Each new word only requires fresh computation for itself, then it simply checks its notes for everything that came before.

It's the difference between recalculating your entire tax history every April and just updating this year's numbers. The result is the same; the process is dramatically less painful.

The New Kid: Diffusion LLMs

Here's where things get interesting. A new breed of language models has been making waves recently, and they work in a completely different way. Models like LLaDA, Dream, and Google's Gemini Diffusion don't write text left to right. Instead, they borrow an idea from the world of AI image generation and apply it to text.

The concept is surprisingly intuitive. Imagine you have a complete sentence, but every word has been replaced with a blank. The model's job is to look at all those blanks and gradually fill them in, not one at a time from left to right, but all at once, refining its guesses across multiple passes. It's less like writing a sentence word by word and more like developing a photograph: the whole picture comes into focus simultaneously.

In technical terms, the model starts with a fully "masked" sequence (essentially all blanks) and iteratively "unmasks" tokens over several denoising steps, each time getting a little closer to coherent text. On each pass, the model predicts what every blank should be, commits the answers it's most confident about, re-masks the ones it's unsure of, and tries again.

The appeal is obvious. Because the model can work on multiple tokens in parallel instead of waiting for each one to finish before starting the next, it has the potential to be significantly faster. Closed-source models like Mercury have reportedly generated over a thousand tokens per second, which is five to ten times faster than traditional approaches of comparable size.

The Caching Problem: Everything Changes All the Time

This is where things get awkward. Remember how the KV cache works so elegantly for traditional models? It relies on one critical assumption: the past doesn't change. Once a word is generated, it's locked in, and all the cached computations remain perfectly valid.

Diffusion LLMs shatter that assumption entirely. Because the model is updating every token position simultaneously across each denoising step, nothing is ever truly "settled" until the very end. The representation of position 5 at step 10 is different from what it was at step 9, and so is the representation of every other position. The entire input shifts globally with each pass.

That means any cached Keys and Values from the previous step are, mathematically speaking, wrong. The model uses bidirectional attention (it looks at the full sequence context, not just what came before), so when anything changes, everything's affected. Trying to reuse cached attention data from the last step would be like navigating with yesterday's weather map. The atmosphere has moved on.

The result? Without caching, diffusion LLMs have to recompute the full attention across the entire sequence at every single denoising step. That's expensive, and it's a big reason why the theoretical speed advantage of these models has been hard to realize in practice.

The Clever Workarounds

Researchers haven't given up, of course. Several clever strategies have emerged to bring caching back into the picture, each with a different philosophy.

The approximate approach takes advantage of a subtle observation: even though the input technically changes at every step, it doesn't change very much between adjacent steps. Most tokens that were confident at step 9 are still confident at step 10. Techniques like dLLM-Cache and Fast-dLLM exploit this stability by caching the Keys and Values for the parts of the sequence that aren't changing much (the prompt tokens and the already-confident response tokens) and only recomputing for the tokens that are still actively being refined. It's not mathematically exact, but it's close enough to save a substantial amount of work without visibly hurting quality.

The block-based approach takes a more structural route. Models like BD3-LM (Block Discrete Denoising Diffusion) split the output into chunks, or "blocks," and generate them one block at a time from left to right. Within each block, the model uses diffusion, refining all the tokens in that block simultaneously. But between blocks, it behaves autoregressively, which means completed blocks can be cached using the exact same KV cache trick that traditional models enjoy. It's a hybrid that gets the best of both worlds: the parallel generation speed of diffusion within each block, and the efficient caching of autoregressive models between blocks. Think of it as writing a book chapter by chapter, where you draft each chapter all at once but move through the chapters sequentially.

The guided approach pairs a large diffusion model with a smaller, faster "guide" model. The big model does the heavy creative work during the diffusion steps, while the lightweight model helps decide which tokens to unmask at each step. This sidesteps some of the need for repeated expensive computation by offloading the decision-making to something cheaper.

Two Philosophies of Remembering

What makes the comparison between these two paradigms so fascinating is how they treat memory. A traditional autoregressive model's cache is precise and permanent. Once a word is generated, its cached data is valid forever (or at least for the rest of that conversation). A diffusion model's cache, where it exists at all, is either approximate and temporary (trusting that adjacent steps are similar enough) or structural (achieved by breaking the problem into blocks that can be locked in sequentially).

One is like writing facts in pen. The other is like working in pencil, knowing you'll erase and revise repeatedly, but looking for ways to avoid redrawing the parts that haven't changed.

Both approaches are, at their core, answering the same question: what work have we already done that we don't need to do again? The answers just happen to be completely different, and the race to find better ones is very much still on.

Why This Matters Beyond the Lab

Caching might sound like a plumbing problem, the kind of thing only engineers care about. But these optimizations directly determine how fast, how cheaply, and how accessibly AI tools can run. Every redundant calculation skipped is a fraction of a second saved, a bit of energy conserved, and a step closer to running powerful models on everyday hardware instead of warehouse-sized data centers.

Diffusion LLMs represent one of the most exciting shifts in language model architecture in years, with genuine potential to change how fast and flexible text generation can be. But that potential is gated, at least in part, by the caching problem. The teams solving it aren't just making these models faster. They're making a whole new way of generating language viable.

The next time you get a near-instant response from a chatbot, spare a thought for the engineering that made sure the AI didn't waste time rediscovering what it already knows.