How do diffusion models generate images?

Diffusion models learn to reverse a noise-adding process. Training: (1) Forward process -- add Gaussian noise to a clean image in T steps (T=1000) until pure noise. This is fixed math, not learned. (2) A U-Net neural network learns to predict the noise at each step: given noisy image x_t, predict the noise epsilon_t. Simple MSE loss between predicted and actual noise. Generation: start with pure noise. Apply the denoising U-Net T times, each step removing a small amount of noise. After T steps: a coherent image emerges. Why they work: simple training objective (no adversarial instability like GANs), diverse outputs (different noise = different images), and excellent quality (state-of-the-art FID scores). Stable Diffusion optimizes by diffusing in compressed latent space (64x64x4) instead of pixel space (512x512x3) -- 50x fewer dimensions. A VAE compresses/decompresses between pixel and latent space.

What is the difference between GANs, VAEs, and Diffusion Models?

GANs: Generator creates images from noise, Discriminator distinguishes real from fake. Adversarial training. Fast generation (single forward pass). Sharp outputs. But: training instability, mode collapse, less diversity. Best for: real-time generation, image-to-image translation (CycleGAN, Pix2Pix). VAEs: Encoder maps to latent distribution, decoder reconstructs. Smooth, interpretable latent space. Blurrier outputs (reconstruction loss encourages averaging). Best for: latent space manipulation, interpolation, anomaly detection. Used in Stable Diffusion as the image compressor. Diffusion Models: learn to denoise from pure noise in many steps. Best quality and diversity. Slow generation (20-50 denoising steps). Simple, stable training. Currently dominant for image generation. The trend: diffusion models have largely replaced GANs for image generation. GANs remain for speed-critical applications. VAEs serve as components within larger systems (Stable Diffusion encoder/decoder).

AI/ML Interview: Generative AI — Diffusion Models, Stable Diffusion, GANs, VAEs, Image Generation, Text-to-Image

⏱ 6 min read

Generative AI creates new content — images, text, audio, video — that did not exist before. Diffusion models (Stable Diffusion, DALL-E, Midjourney) have revolutionized image generation, while GANs and VAEs remain important for specific applications. Understanding generative model architectures is increasingly tested in ML interviews, especially at companies building creative AI tools. This guide covers the key generative model families with interview-level depth.

Diffusion Models

Diffusion models generate images by learning to reverse a noise-adding process. Training: (1) Forward process (adding noise) — start with a clean image x_0. Add Gaussian noise in T steps: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * noise. After T steps (T=1000), x_T is pure noise. This is NOT learned — it is a fixed mathematical process. (2) Reverse process (denoising) — a neural network (U-Net) learns to predict the noise added at each step: given x_t, predict the noise epsilon_t. The model is trained with a simple loss: MSE between predicted noise and actual noise. (3) The network is conditioned on the timestep t (how noisy the image is) so it can denoise appropriately at each step. Generation: start with pure noise x_T. Apply the denoising network T times: x_{T-1} = denoise(x_T, T), x_{T-2} = denoise(x_{T-1}, T-1), …, x_0 = final image. Each step removes a small amount of noise, gradually revealing a coherent image. Why diffusion models succeed: the training objective is simple (predict noise — no adversarial training instability like GANs), the model can generate diverse outputs (different noise inputs produce different images), and the quality is excellent (state-of-the-art FID scores on image benchmarks).

Stable Diffusion and Text-to-Image

Stable Diffusion (Stability AI) generates images from text prompts. Architecture: (1) Text encoder (CLIP) — converts the text prompt into a text embedding. CLIP was trained on 400M image-text pairs, so it understands the relationship between text descriptions and visual concepts. (2) Latent diffusion — instead of diffusing in pixel space (high dimensional: 512x512x3 = 786K dimensions), diffuse in latent space (compressed: 64x64x4 = 16K dimensions). A VAE encoder compresses images to latent representations. The U-Net denoises in latent space (much cheaper). A VAE decoder converts the denoised latent back to pixel space. (3) Cross-attention conditioning — the U-Net has cross-attention layers that attend to the text embedding. This guides the denoising toward generating an image matching the text description. At each denoising step, the text embedding influences which noise is removed. Generation: encode the text prompt with CLIP. Start with random noise in latent space. Denoise for 20-50 steps (each step applies the U-Net conditioned on the text embedding). Decode the final latent to pixels with the VAE decoder. Classifier-Free Guidance (CFG): at each step, compute two predictions: one conditioned on the text (text-guided) and one unconditioned (unconditional). The final prediction is: guided = unconditional + scale * (text_guided – unconditional). Higher scale (7-15) produces images more closely matching the prompt but with less diversity.

GANs (Generative Adversarial Networks)

GANs train two networks adversarially: a Generator (creates fake images from random noise) and a Discriminator (distinguishes real from fake). The Generator improves by trying to fool the Discriminator. The Discriminator improves by better detecting fakes. At equilibrium: the Generator produces images indistinguishable from real ones. Training dynamics: the Generator loss decreases when it fools the Discriminator. The Discriminator loss decreases when it correctly classifies real/fake. If one dominates: mode collapse (Generator produces limited variety) or training instability (oscillating losses). GAN variants: (1) StyleGAN (NVIDIA) — generates high-quality faces with controllable style (age, gender, expression, hair). Uses progressive growing and a mapping network for disentangled latent space. (2) CycleGAN — image-to-image translation without paired data (horse -> zebra, summer -> winter). (3) Pix2Pix — image-to-image translation with paired data (sketch -> photo, semantic map -> image). (4) DCGAN — the foundational convolutional GAN architecture. GANs vs Diffusion: GANs are faster at generation (single forward pass vs 20-50 denoising steps). But GANs are harder to train (mode collapse, training instability), produce less diverse outputs, and have largely been superseded by diffusion models for image generation. GANs remain relevant for: real-time applications (fast generation), specific image-to-image translation tasks, and video generation (some approaches still use GAN discriminators).

VAEs (Variational Autoencoders)

VAEs learn a compressed latent representation of data that can be sampled to generate new data. Architecture: an encoder maps input x to a distribution in latent space (mean mu and variance sigma). A latent vector z is sampled from this distribution (reparameterization trick: z = mu + sigma * epsilon, where epsilon ~ N(0,1)). A decoder reconstructs x from z. Loss: reconstruction loss (MSE between input and output) + KL divergence (regularizes the latent space to be close to a standard normal distribution). The KL term ensures the latent space is smooth and continuous — nearby points in latent space produce similar outputs. This enables: interpolation (blend between two images by interpolating their latent vectors) and sampling (generate new images by sampling from the standard normal distribution in latent space). VAEs vs GANs: VAEs produce blurrier outputs (the reconstruction loss encourages averaging, while GANs use an adversarial loss that encourages sharp outputs). VAEs have a well-defined latent space (useful for interpolation and manipulation). GANs produce sharper, more realistic images but with a less interpretable latent space. VAEs in practice: used in Stable Diffusion as the image compressor/decompressor (VAE encoder/decoder). Also used for: anomaly detection (data far from the learned distribution is anomalous), data compression, and drug discovery (generating novel molecular structures).

Evaluation of Generative Models

Evaluating generative quality is hard — there is no single “correct” output. Metrics: (1) FID (Frechet Inception Distance) — compares the distribution of generated images with real images in the feature space of an Inception network. Lower FID = generated images are more similar to real ones in terms of visual quality and diversity. The standard metric for image generation. (2) IS (Inception Score) — measures quality (each image should be confidently classified by Inception) and diversity (the set of images should cover many classes). Higher IS = better. Limited: does not compare against real data distribution. (3) CLIP Score — for text-to-image: cosine similarity between the CLIP embedding of the generated image and the CLIP embedding of the text prompt. Higher = the image better matches the prompt. (4) Human evaluation — the gold standard. Show generated images to human raters and evaluate: quality (does it look realistic?), relevance (does it match the prompt?), and preference (which of two generations is better?). Expensive but necessary for production evaluation. For interviews: know FID (the standard) and CLIP Score (for text-to-image). Mention that human evaluation is the ultimate measure. Understanding that automated metrics are imperfect proxies shows maturity.