Stable Diffusion: full architecture (UNet + VAE + CLIP)

Stable Diffusion: the full architecture — UNet, VAE and CLIP

Stable Diffusion (Rombach et al., 2022) made high-quality text-to-image generation accessible by combining three independently powerful ideas: running diffusion in a compressed latent space (VAE), conditioning on rich text embeddings (CLIP), and using a U-Net for the denoising. Understanding how these three components fit together is the key to understanding why SD works — and how to control it.

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.