Rethinking Classifier-Free Guidance for Diffusion Models
Abstract
...However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pretrained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called timestep guidance (TSG), which can be applied to any diffusion model, including unconditional ones. Our guidance techniques are easy to implement and have the same sampling cost as CFG. Through extensive experiments, we demonstrate that ICG matches the performance of standard CFG across various conditional diffusion models. Moreover, we show that TSG improves generation quality in a manner similar to CFG, without relying on any conditional information.
Introduction
Diffusion models have recently emerged as the main methodology [DDPM, SD] ... [basic explain of diffusion models] ... [noise sample not always high-quality, controls like CG, CFG will limit the diversity].
CFG's limitations
- requires the underlying model to be trained in a specific way to also learn the unconditional score function. Typically by replacing the conditioning vector with a null vector with probability p. Replacing the condition might also not be straightforward when the model is multimodal and uses different conditioning signals such as text, images, and audio at the same time, or when the null vector (which is usually the zero vector in practice) has a specific meaning.
- it is not clear how to extend the benefits of classifier-free guidance beyond conditional models to unconditional generation
Contributions
independent condition guidance (ICG): In this paper, we analyze the methodology behind classifier-free guidance and show theoretically that similar behavior can be achieved without additional training of an unconditional model. The main idea is that by using a conditioning vector independent of the input data, the conditional score function becomes equivalent to the unconditional score. This insight leads us to propose independent condition guidance (ICG), a method that replicates the behavior of CFG at inference time without requiring separate training of an unconditional model.
timestep guidance (TSG): Inspired by the above, we also introduce a novel technique to extend classifier-free guidance to a more general setting that includes unconditional generation. This method, which we call timestep guidance (TSG), employs a perturbed version of the time-step embedding in diffusion models to create a guidance signal similar to CFG. Time-step guidance aims to improve the accuracy of denoising at each sampling step by leveraging the time-step information learned by the diffusion model to steer sampling trajectories toward better noise-removal paths.
Related Work
Score-based diffusion models [42, 43, 39, 14] learn the data distribution by reversing a forward diffusion process that progressively transforms the data into Gaussian noise. These models have quickly surpassed the fidelity and diversity of previous generative modeling methods [27, 9], achieving state-of-the-art results in various domains, including unconditional image generation [9, 18], textto-image generation [32, 37, 2, 33, 30, 47], video generation [5, 4, 10], image-to-image translation [36, 22], motion synthesis [45, 46], and audio generation [7, 20, 17].
Despite these recent advances, diffusion guidance, including classifier [9] and classifier-free guidance [13], still plays an essential role in improving the quality of generations as well as increasing the alignment between the condition and the output image [28].
SAG [15] and PAG [1] have recently been proposed to increase the quality of UNet-based diffusion models by modifying the predictions of the self-attention layers. Our method is complementary to these approaches, as one can combine ICG updates with the update signal from the perturbed attention modules [15]. In addition, we make no assumptions about the network architecture.
Another line of work includes guiding the generation of the diffusion model with a differentiable loss function or an off-the-shelf classifier [41, 8, 48, 3, 11]. These methods are primarily focused on solving inverse problems, typically with unconditional models, while we are instead concerned with achieving the benefits of CFG in conditional models without any additional training requirements. With TSG, we also generalize our approach to extend CFG-like benefits to unconditional models.
Perturbing the condition vector is employed in CADS [35] to increase the diversity of generations. CADS differs from ICG in focusing on the conditional branch to improve diversity, while ICG is concerned with the unconditional branch to simulate CFG. Since CADS is designed to enhance the diversity of CFG, it can be used alongside ICG to improve the diversity of output at high guidance scales (see Appendix C.1).
Background
This section provides an overview of diffusion models. Let \(\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})\) be a data point, \(t \in[0,1]\) be the time step, and \(\boldsymbol{z}_t=\boldsymbol{x}+\sigma(t) \boldsymbol{\epsilon}\) be the forward process of the diffusion model that adds noise to the data. Here \(\sigma(t)\) is the noise schedule and determines how much information is destroyed at each time step \(t\), with \(\sigma(0)=0\) and \(\sigma(1)=\sigma_{\max }\). Karras et al. [18] showed that this forward process corresponds to the ordinary differential equation (ODE)
\[ \mathrm{d} \boldsymbol{z}=-\dot{\sigma}(t) \sigma(t) \nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t\right) \mathrm{d} t \]
or, equivalently, a stochastic differential equation (SDE) given by
\[ \mathrm{d} \boldsymbol{z}=-\dot{\sigma}(t) \sigma(t) \nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t\right) \mathrm{d} t-\beta(t) \sigma(t)^2 \nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t\right) \mathrm{d} t+\sqrt{2 \beta(t)} \sigma(t) \mathrm{d} \omega_t \]
Here \(\mathrm{d} \omega_t\) is the standard Wiener process, and \(p_t\left(\boldsymbol{z}_t\right)\) is the time-dependent distribution of noisy samples, with \(p_0=p_{\text {data }}\) and \(p_1=\mathcal{N}\left(\mathbf{0}, \sigma_{\max }^2 \boldsymbol{I}\right)\). Assuming that we have access to the time-dependent score function \(\nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t\right)\), we can sample from the data distribution \(p_{\text {data }}\) by solving the ODE or SDE backward in time (from \(t=1\) to \(t=0\) ). The unknown score function \(\nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t\right)\) is estimated via a neural denoiser \(D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t\right)\) that is trained to predict the clean samples \(\boldsymbol{x}\) from the corresponding noisy samples \(\boldsymbol{z}_t\). The framework allows for conditional generation by training a denoiser \(D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)\) that accepts additional input signals \(\boldsymbol{y}\), such as class labels or text prompts.
Training objective Given a noisy sample \(\boldsymbol{z}_t\) at time step \(t\), the denoiser \(D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)\) with parameters \(\boldsymbol{\theta}\) can be trained with the standard MSE loss (also called denoising score matching loss)
\[ \underset{\boldsymbol{\theta}}{\arg \min } \mathbb{E}_t\left[\left\|D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)-\boldsymbol{x}\right\|^2\right] . \]
The trained denoiser approximates the time-dependent conditional score function \(\nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t \mid \boldsymbol{y}\right)\) via
\[ \nabla_{\boldsymbol{z}_t} \log p_t\left(\boldsymbol{z}_t \mid \boldsymbol{y}\right) \approx \frac{D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)-\boldsymbol{z}_t}{\sigma(t)^2} \]
Classifier-free guidance (CFG) CFG is an inference method for improving the quality of generated outputs by mixing the predictions of a conditional and an unconditional model [13]. Specifically, given a null condition \(\boldsymbol{y}_{\text {null }}=\varnothing\) corresponding to the unconditional case, CFG modifies the output of the denoiser at each sampling step according to \[ \hat{D}_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)=D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}_{\mathrm{null}}\right)+w_{\mathrm{CFG}}\left(D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}\right)-D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_t, t, \boldsymbol{y}_{\mathrm{null}}\right)\right) \]
where \(w_{\text {CFG }}=1\) corresponds to the non-guided case. The unconditional model \(D_{\boldsymbol{\theta}}\left(\boldsymbol{z}_{\boldsymbol{t}}, t, \boldsymbol{y}_{\text {null }}\right)\) is trained by randomly assigning the null condition \(y_{\text {null }}=\varnothing\) to the input of the denoiser with probability \(p\), where we normally have \(p \in[0.1,0.2]\). One can also train a separate denoiser to estimate the unconditional score in Equation (5) [19]. Similar to the truncation method in GANs [6], CFG increases the quality of individual images at the expense of less diversity [26].