扩散模型（Diffusion Model）原理

约 1554 字大约 5 分钟

diffusiongenerative-ai

2025-09-03

扩散模型是当前图像生成领域的主流范式，通过学习逐步去噪的过程实现高质量的数据生成。本文详细介绍扩散模型的数学原理、关键架构和实际应用。

核心思想

扩散模型包含两个过程：前向扩散（逐步添加噪声直到数据变为纯噪声）和反向去噪（从噪声逐步恢复原始数据）。

前向扩散过程

前向过程是一个马尔可夫链，每一步给数据添加少量高斯噪声：

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

其中 $\beta_t$ 是噪声调度参数。通过重参数化技巧，可以直接从 $x_0$ 采样任意时间步 $t$ 的噪声版本：

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

其中 $\alpha_t = 1 - \beta_t$ ， $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ 。

import torch

def forward_diffusion(x0, t, noise_schedule):
    """前向扩散：给 x0 添加 t 步噪声"""
    alpha_bar = noise_schedule['alpha_bar'][t]
    noise = torch.randn_like(x0)
    # x_t = sqrt(alpha_bar) * x0 + sqrt(1 - alpha_bar) * noise
    xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise
    return xt, noise

DDPM（Denoising Diffusion Probabilistic Models）

DDPM（Ho et al., 2020）是现代扩散模型的奠基工作。模型学习预测每一步添加的噪声，训练目标是最小化预测噪声与真实噪声之间的 MSE：

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

import torch.nn as nn

class SimpleDiffusion(nn.Module):
    def __init__(self, model, num_timesteps=1000):
        super().__init__()
        self.model = model  # U-Net 噪声预测网络
        self.T = num_timesteps
        # 线性噪声调度
        betas = torch.linspace(1e-4, 0.02, num_timesteps)
        alphas = 1 - betas
        alpha_bar = torch.cumprod(alphas, dim=0)
        self.register_buffer('betas', betas)
        self.register_buffer('alpha_bar', alpha_bar)

    def training_loss(self, x0):
        batch_size = x0.shape[0]
        # 随机采样时间步
        t = torch.randint(0, self.T, (batch_size,), device=x0.device)
        noise = torch.randn_like(x0)
        # 前向扩散
        alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
        xt = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
        # 预测噪声
        predicted_noise = self.model(xt, t)
        loss = nn.functional.mse_loss(predicted_noise, noise)
        return loss

采样过程

@torch.no_grad()
def sample(self, shape):
    """从纯噪声逐步去噪生成图像"""
    x = torch.randn(shape, device=self.betas.device)

    for t in reversed(range(self.T)):
        t_batch = torch.full((shape[0],), t, device=x.device, dtype=torch.long)
        predicted_noise = self.model(x, t_batch)

        alpha = 1 - self.betas[t]
        alpha_bar = self.alpha_bar[t]

        # 去噪一步
        x = (1 / torch.sqrt(alpha)) * (
            x - (self.betas[t] / torch.sqrt(1 - alpha_bar)) * predicted_noise
        )
        # 添加随机噪声（最后一步除外）
        if t > 0:
            x += torch.sqrt(self.betas[t]) * torch.randn_like(x)

    return x

噪声调度（Noise Scheduling）

噪声调度决定了每一步添加多少噪声，对生成质量影响很大。

调度方式	描述	特点
Linear	beta 线性增长	DDPM 原始方案
Cosine	基于余弦函数	更平滑，Improved DDPM
Scaled Linear	缩放后的线性	Stable Diffusion 使用

U-Net 架构

扩散模型中的去噪网络通常采用 U-Net 架构，具有编码器-解码器结构和跳跃连接：

关键组件：

时间嵌入：将离散时间步 t 编码为正弦位置编码，注入每个残差块
Self-Attention：在中低分辨率特征图上应用，捕获全局关系
Cross-Attention：注入条件信息（如文本嵌入）
ResNet 块：每个下采样/上采样级别包含多个残差块

Classifier-Free Guidance（无分类器引导）

CFG 在推理时混合条件和无条件预测，增强生成结果与条件的一致性：

\hat{\epsilon} = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

其中 $w$ 是引导强度（guidance scale），通常取 7.5。训练时以一定概率（如 10%）随机丢弃条件（用空条件替代）。

def guided_sample_step(model, xt, t, condition, guidance_scale=7.5):
    """Classifier-Free Guidance 采样步骤"""
    # 无条件预测
    noise_uncond = model(xt, t, condition=None)
    # 条件预测
    noise_cond = model(xt, t, condition=condition)
    # 引导
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
    return noise_pred

潜在扩散模型（Latent Diffusion / Stable Diffusion）

Stable Diffusion 的核心创新是在潜在空间（latent space）而非像素空间进行扩散过程，大幅降低了计算成本。

组件说明：

VAE：将 512x512 图像压缩到 64x64x4 的潜在空间（压缩 48 倍）
U-Net：在潜在空间执行去噪
CLIP Text Encoder：将文本提示编码为条件向量
Cross-Attention：在 U-Net 中注入文本条件

ControlNet

ControlNet 在 Stable Diffusion 的 U-Net 上添加一个可训练的分支，接受额外的控制信号（如边缘图、深度图、姿态），实现精确的空间控制。

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
)

# 使用 Canny 边缘图控制生成
image = pipe(
    prompt="a beautiful landscape painting",
    image=canny_edge_image,
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

SDXL

SDXL 是 Stable Diffusion 的升级版，主要改进：

更大的 U-Net（2.6B 参数）
双文本编码器（CLIP ViT-L + OpenCLIP ViT-bigG）
两阶段生成：base model + refiner model
支持更高分辨率（1024x1024）

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.to("cuda")

image = pipe(
    prompt="A photo of an astronaut riding a horse on Mars",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=1024,
    width=1024,
).images[0]

总结

扩散模型通过优雅的前向加噪-反向去噪框架实现了高质量的数据生成。从 DDPM 到 Latent Diffusion 再到 SDXL，关键创新包括噪声调度优化、潜在空间扩散、Classifier-Free Guidance 以及 ControlNet 等条件控制机制。扩散模型已经成为图像、视频、音频、3D 生成的通用范式。

贡献者

withesse

更新日志

2026/3/14 13:09

查看所有更新日志

9f6c2-feat: organize wiki content and refresh site setup于 2026/3/14