Diffusion models: the intro
How do prompts become images? Crash introduction to the science behind the magic
In this post I will explain the core ideas powering image generation models - in a rigorous manner, but with the least amount of jargon. In the good tradition of academic books (once a math guy, always a math guy), we start with the theory.
Why does it work?
Diffusion models are the magic behind all the images you've probably seen: Dall-E, Stable Diffusion, Flux, Nano Banana. The core idea is simple: to create something, the AI first learns how to destroy it - not exactly intuitive, I know. But bear with me. The process has two steps:
The forward process (= the destruction): it starts by taking a picture and then adding more and more random noise to it. On its own it *does* work, but is horribly inefficient - which is why we have computational shortcut that allows us to jump from the clean image to any level of noise in a single shot without going through the intermediate small steps. Because of this, the algo can generate training pairs {clean,noisy} almost instantly.
The reverse process (= the creation): start with a noisy picture, and its task is to figure out exactly how to clean it up a bit, i.e. to reverse one step of that noising process. This part is repeated millions of times on images with different levels of noise: this is the part the algo learns. At every single step, the the neural network looks at the current noisy picture and has to make its best guess for what the previous, slightly cleaner version looked like.
Mathematically, figuring out the perfect reverse step is impossible - but if you control the amount of noise, you can reduce to merely “computationally expensive”. Throw enough GPU at a deep neural network, and voila: we can un-bake the cake. The original idea was to take the noisy picture and predict what the previous, cleaner version looked - not feasible. What *could* be predicted successfully was the noise:
the algo add real noise to the image
we predict the noise
we compare the predicted noise to the real one and train the model to minimize the difference.
The additional advantage of learning to predict the noise is that the noise scheduler (= how much we add at different steps in the forward process) acts like a built-in teacher:
high noise → the algo is forced to learn the big, general shapes (e.g. “blob at the top is probably a head”)
low noise → the focus shifts fine-grained details and textures
By training on this whole spectrum of noise all at once, we force the model to develop a multi-scale understanding of what images are supposed to look like: from the biggest shapes down to the tiny details.
Once the model has been trained, the algorithm becomes very good at cleaning up noise. You can now use it to generate an image: when you type a prompt, you are giving a model a noise image and a guideline on how to de-noise (since there are near infinite possibilities). The reverse process is applied, step-by-step, to create a totally new image from scratch.
The whole idea is borrowed from thermodynamics, and the mathematics that powers it are DDPM: denoising diffusion probabilistic models. Proper technical explanation can be found in the reading list (see bottom of the page).
Deep dive: the architecture
So the idea is sound: how to turn it into practice? It takes a few components.
Efficiency in the latent space
The first diffusion models had a massive problem: they tried to do all their work directly on the pixels of the image - doable, but extremely slow. What if we were compress the image, do all the hard work on the small version, and then inflate it back? Enter Variational Autoencoder - a smart compression solution composed of two parts:
The Encoder: it takes a high-res image and squishes it down into a much smaller "latent" map. This tiny map is like a compressed blueprint that keeps all the important info about what's in the picture.
The Decoder: this part can take the latent map and perfectly (= without much information loss) rebuilds it back into a full-resolution image.
The *entire* slow, step-by-step diffusion process happens in the compressed "latent space"- the trick allows us to cut the computational cost by several orders of magnitude.
The denoising engine
The workhorse of the entire system is the U-Net architecture, which is responsible for cleaning up the noise at every single step. It’s called a U-Net because, well, that’s what a diagram looks like:
How does it work?
Going down the U (encoder): it takes the noisy image and shrinks it down, layer by layer. As it shrinks the image, it's trying to understand the "big picture" concepts.
Going up the U (decoder): it takes that compressed, abstract idea and builds it back up to the original size, adding more and more detail until it can make its final prediction of what the noise looks like.
The most important part are the skip connections (horizontal arrows in the graph above): direct connections from the "down" side of the U to the "up" side. When the image is shrunk, the fine-grained details like sharp edges and textures are lost. These "skip connections" act like a cheat sheet: they let the algo recover the earlier information as it’s building the image back up.
Last but absolutely not least: ATTENTION. The paper that introduced the concept in ML has its own Wiki page - that’s how big of a deal it is:
https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
This is the part that lets the algo pay attention to your text prompt: when you write "a woman in a red hat," the cross-attention part is what connects the word "red" to the "hat" part of the image it's drawing. It's the bridge between your words and result.
Images and words
The text encoder
The main image-drawing AI is a brilliant artist, but it's also illiterate (not unheard of in history). It only understands math (like diffusion) → we need a translator: text encoder. The big models usually use a powerful, pre-trained one - CLIP is a popular choice:
First, a tokenizer cuts your sentence up into smaller pieces (to assess text size you can think assume 1 token ≈ 1 word; it’s not quite correct, but gets you in the ballpark)
The tokens are fed into a transformer, that analyzes the concepts and the relationships and creating a numerical vector representation of your text: an embedding.
The cross-attention mechanism
Cross-attention is the mechanism that injects the textual guidance into the spatial domain of the image being generated - or, in plain English, it connects the text (= what you want) into how the noise becomes an image. It operates within the attention blocks of the U-Net. It has three key components:
The Query vectors are derived from the spatial features of the noisy image latent representation. Each query can be thought of as a question from a specific patch of the image: "What content should be here?"
The Key and Value vectors are derived from the text embeddings produced by the text encoder. The keys represent the semantic concepts available in the prompt, and the values contain the information associated with those concepts.
Long story short, the image and the text are now vectors living in the same space - which means we can treat them with linear algebra: similarity scores, normalization etc. This process is repeated at multiple layers and spatial resolutions within the U-Net, allowing for a highly detailed and context-aware alignment between the text prompt and the final image: the query for each patch of the image is matched with the key that has the closest value, guiding the algo what to draw in this patch.
Orchestrating the de-noising: the scheduler
The U-Net can clean up a noisy image for any given step - but when you're making a picture, you have to start with 100pct pure static and guide it all the way down to a clean image. The recipe for doing that is the scheduler: it is the algorithm that takes that prediction and uses a specific formula to calculate what the next, slightly less noisy image should look like.
The biggest reason to care about which scheduler you pick is the speed vs. quality trade-off.
What can we do with it?
text-to-image: you type a prompt, and it generates a picture. Classic, everyone has seen it by now :-)
image-to-image: you start with an image and you want to change the style. Instead of starting with pure static, the process takes your original picture, adds some noise to it, and then does the same cleanup routine guided by your new text prompt. The whole idea is to take the style from one picture and apply it to your picture, all while keeping the main subject intact.
Inpainting: You can draw a mask over a part of an image you don't like and the AI will fill it in. It looks at all the stuff around the hole you made and generates something new that matches the lighting, style, and content.
Outpainting: Maybe instead of changing sth in the image you want to extend a picture it its original borders. It works like a big-picture version of inpainting: you put your photo on a bigger, blank canvas and tell the algo to treat all the blank space as a "hole" to be filled in.
Instruction-based editing: You modify the image using natural language commands, like "make his shirt blue" or "add sunglasses to her face”. This is based on inversion: you feed your real picture into the algo and say effectively "Tell me the exact recipe - the specific starting noise and latent representation - that would have created this exact image". Once you have that original recipe, you can make tiny edits to it and then run the process forward again. This is what allows for precise editing while keeping everything else you want to preserve intact - in theory anyway, since in practice ymmv.
The future
The core idea of diffusion can be applied to other data types - and it can be improved for images as well.
Beyond 2D
The success of diffusion models in 2D image synthesis has catalyzed intense research into applying the same principles to higher-dimensional and more structured data types.
Video Generation
So the big jump now is from still images to video, and the main hurdle is 'temporal coherence': in plain English, it's not enough for each frame to look real on its own, they have to flow together logically. In practice this means making sure things move correctly and that a person doesn't suddenly have a different face halfway through.
To get this to work, people are taking the classic math approach: how does a mathematician conceptualize a 5-dimensional space? He imagines an n-dimensional one, and then substitutes n = 5.
If you think of time as just another dimension, adapting the usual U-Net / Transformer to work with 3D data (= 2D image + time dimension) is not trivial, but fairly elementary. The model learns to denoise an entire video clip at once, and the attention mechanism looks at pixels and across different frames.
3D Generation
Unlike a flat 2D picture, you can describe a 3D object in different ways: like a cloud of dots, a grid of tiny cubes (voxels), a mesh, to name just a few. People are trying two main approaches right now:
Direct: take the 3D model and apply that same "add noise, then learn to remove it" trick. It's about teaching the AI how to build a 3D shape from complete scratch.
Coaching with 2D: this approach takes advantage of how good the 2D image models already are. You start with a rough 3D blob, and keep tuning using 2D shots from different angles - with an image generation model acting as a scorer.
Audio Generation
It turns out the diffusion model idea works for images: by applying that same step-by-step denoising process to audio formats like spectrograms or even the raw sound waves, these models are getting state-of-the-art results for tasks like text-to-speech, generating music, and creating general sound effects.
Emerging research
We have come a long way since the first Stable Diffusion or Dall-E, but there is a lot going on around image generation models:
Improving efficiency: we get much more, much faster - but the fact remains, sampling is slow (even on beefy GPUs). People are working on schedulers that can achieve high-quality results in fewer steps, and model distillation (= large, slow "teacher" diffusion model is used to train a compact, fast "student" model capable of single-step generation). The core idea with the latter is very similar to text with LLMs: you get a huge model - think DeepSeek and it gazillion parameters, you generate a ton of content from the big one, and then finetune a smaller one (e.g. Llama 14B) to replicate the big.
Architectural innovations: replacing the U-Net backbone with Diffusion Transformers (more scalable) seem promising for improving both performance and efficiency.
Faithfulness: even though text prompts are super powerful, the models can still mess up or just ignore you if your instructions are too complicated or nuanced (remember: we are embedding the prompt, so information loss is inevitable). This is especially true if what you're asking for goes against all the "normal" images it was trained on (rule 34, no exceptions).
Deepening theoretical understanding: even though diffusion models are incredibly successful, we don't have a complete theoretical grasp on why they work so well. The math and theory behind them are still playing catch-up to the real-world results. There are still some big open questions that researchers are trying to answer: for instance, there isn't a formal explanation for how they're so good at generalizing what they've learned to create totally new images.
Conclusion
That concludes our crash intro - you should now have a good grasp of how diffusion models work. If you're interested in learning more, here is a reading list:
Great article, thank you for making it so tractable