Classifier Guidance

Posted on 2024-12-20 Edited on 2025-01-28 In notes , Diffusion Disqus:

Diffusion Models Beat GANs on Image Synthesis

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models.

Architecture Improvements

We explore the following architectural changes:

increasing depth versus width, holding model size relatively constant;
increasing the number of attention heads; using attention at 32$$32, 16$$16, and 8$$8 resolutions rather than only at 16$$16;
using the BigGAN [8] residual block for upsampling and downsampling the activations, following[67];
rescaling residual connections with $\frac{1}{\sqrt{2}}$, following [67, 33, 34].

In the rest of the paper, we use this final improved model architecture as our default:

128 base channels
BigGAN up/down sampling
train the model for 700k iterations
variable width with 2 residual blocks per resolution, multiple heads with 64 channels per head, attention at 32, 16 and 8 resolutions
adaptive group normalization for injecting timestep and class embeddings into residual blocks

\[ \operatorname{AdaGN}(h, y)=y_s \operatorname{GroupNorm}(h)+y_b \]

where $y_s$ and $y_b$ are the linear projection of timestep and class embeddings, respectively.

Classifier Guidance (CG)

In particular, we can train a classifier $p_\phi\left(y \mid x_t, t\right)$ on noisy images $x_t$, and then use gradients $\nabla_{x_t} \log p_\phi\left(y \mid x_t, t\right)$ to guide the diffusion sampling process towards an arbitrary class label $y$.

Covariance Matrix

\[ var(X) = \frac{\sum_{i=1}^n (x_i - \overline{X})(x_i - \overline{X})}{n-1} \]

\[ cov(X, Y) = \frac{\sum_{i=1}^n (x_i - \overline{X})(y_i - \overline{Y})}{n-1} \]

$cov(X, X)=var(X)$
$cov(X,Y)=cov(Y,X)$

For n-dimension \[ C_{n \times n}=\left(c_{i, j}, c_{i, j}=\operatorname{cov}\left(\operatorname{Dim}_i, \operatorname{Dim}_j\right)\right) \] for $n=3$: \[ \Sigma=\left(\begin{array}{ccc} \operatorname{cov}(x, x) & \operatorname{cov}(x, y) & \operatorname{cov}(x, z) \\ \operatorname{cov}(y, x) & \operatorname{cov}(y, y) & \operatorname{cov}(y, z) \\ \operatorname{cov}(z, x) & \operatorname{cov}(z, y) & \operatorname{cov}(z, z) \end{array}\right) \]

Determinant of the Covariance Matrix

The determinant of the covariance matrix $|\Sigma|$ is a measure of the spread (or volume) of the Gaussian distribution in the multidimensional space.

Physical Meaning: - The determinant of $\Sigma$ determines the volume of the ellipsoid that encloses a given probability region (e.g., $68 \%$ for one standard deviation, $95 \%$ for two, etc.). - Larger $|\Sigma|$ : The distribution is more "spread out," indicating higher variance in the data. - Smaller $|\Sigma|$ : The distribution is more "concentrated," indicating lower variance in the data.

Multivariate Gaussian Distribution

The Probability Density Function (PDF) of a Multivariate Gaussian is derived from first principles using the properties of linear algebra and calculus, combined with the properties of a Gaussian in one dimension.

Step 1: Start with the 1D Gaussian PDF

For a univariate Gaussian random variable $X \sim \mathcal{N}(\mu, \sigma^2)$, the PDF is: \[ p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2} \right) \] This formula tells us that $p(x)$ is the probability density of a single random variable $x$, where:

$\mu$ is the mean,
$\sigma^2$ is the variance,
$(x - \mu)^2$ is the squared distance from the mean.

Step 2: Generalize to Multiple Dimensions

Now consider a random vector $\mathbf{x} \in \mathbb{R}^d$ following a multivariate Gaussian distribution:

$\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$

where:

$\boldsymbol{\mu}$ is the mean vector $(d \times 1)$.
$\boldsymbol{\Sigma}$ is the covariance matrix $(d \times d)$.

In multiple dimensions, the Gaussian distribution considers:

The covariance structure: How the dimensions are correlated (via $\boldsymbol{\Sigma}$).
Ellipsoidal shapes: The Gaussian is no longer symmetric around the mean—it is shaped by $\boldsymbol{\Sigma}$.

Step 3: Generalize the Squared Distance

In 1D, the term $\frac{(x - \mu)^2}{\sigma^2}$ in the exponent measures how far $x$ is from the mean, normalized by the variance. In multiple dimensions, this generalizes to the Mahalanobis distance: \[ (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \] Here:

$(\mathbf{x} - \boldsymbol{\mu})$: A column vector representing the deviation from the mean.
$\boldsymbol{\Sigma}^{-1}$: The inverse covariance matrix, which weights the deviation according to the variance and correlation in each dimension.
$(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$: A scalar value representing the "distance" of $\mathbf{x}$ from $\boldsymbol{\mu}$, scaled by $\boldsymbol{\Sigma}^{-1}$.

This quadratic form accounts for the spread and correlation in the multivariate space.

Step 4: Normalization Constant

In 1D, the normalization constant is $\frac{1}{\sqrt{2\pi \sigma^2}}$. This ensures that the total probability integrates to 1. For d-dimensions, the normalization constant generalizes as: \[ \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \] Where:

$(2\pi)^{d/2}$: Accounts for dd-dimensional space.
$|\boldsymbol{\Sigma}|^{1/2}$: The square root of the determinant of $\boldsymbol{\Sigma}$, which measures the "volume" of the distribution. It ensures that the PDF integrates to 1 in dd-dimensional space.

Step 5: Combine Components

Finally, combining the normalization constant and the exponential term gives the PDF of the multivariate Gaussian: \[ p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]

Probability Density Function

The probability density function of a multivariate Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ is given by: \[ p(x)=\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right) \]

To get the logarithm of the probability density function, take the log of the equation above:

\[ \log p(x)=\log \left(\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}}\right)-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \]

This expands as:

\[ \log p(x)=-\frac{d}{2} \log (2 \pi)-\frac{1}{2} \log |\Sigma|-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \]

The first two terms ( $-\frac{d}{2} \log (2 \pi)$ and $-\frac{1}{2} \log |\Sigma|$ ) are constants with respect to $x$. These are often grouped together into a single constant $C$, because they don't affect gradient computations during optimization. So:

\[ \log p(x)=-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)+C \]

for DDPM

For DDPM, $p(y \mid x_t, x_{t+1})=p(y \mid x_t)$

\[ \begin{aligned} p\left(x_t \mid x_{t+1}, y\right) & =\frac{p\left(x_t, x_{t+1}, y\right)}{p\left(x_{t+1}, y\right)} \\ & =\frac{p\left(x_t, x_{t+1}, y\right)}{p\left(y \mid x_{t+1}\right) p\left(x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(x_{t+1}\right) p\left(y \mid x_t, x_{t+1}\right) }{p\left(y \mid x_{t+1}\right) p\left(x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(y \mid x_t, x_{t+1}\right)}{p\left(y \mid x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(y \mid x_t\right)}{p\left(y \mid x_{t+1}\right)} \\ \end{aligned} \]

\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)=Z \cdot p_\theta\left(x_t \mid x_{t+1}\right) \cdot p_\phi\left(y \mid x_t\right) \]

First one:

\[ \log{p_\theta\left(x_t \mid x_{t+1}\right)} = -\frac{1}{2}\left(x_t-\mu\right)^T \Sigma^{-1}\left(x_t-\mu\right) + C \]

\[ p_\theta\left(x_t \mid x_{t+1}\right) \sim N(\mu, \Sigma) \]

Second one:

\[ \begin{aligned} \log p_\phi\left(y \mid x_t\right) & \left.\approx \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu}+\left.\left(x_t-\mu\right) \nabla_{x_t} \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu} \\ & =C_1 + \left(x_t-\mu\right) g \end{aligned} \]

\[ C_1 \text{ is constant and } g=\left.\nabla_{x_t} \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu} \]

Combined: \[ \begin{aligned} \log{p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)} &= \log{Z} + \log{p_\theta\left(x_t \mid x_{t+1}\right)} + \log{p_\phi\left(y \mid x_t\right)} \\ & \approx-\frac{1}{2}\left(x_t-\mu\right)^T \Sigma^{-1}\left(x_t-\mu\right)+ C_2 + \left(x_t-\mu\right) g \\ & =-\frac{1}{2}\left[x_t^T \Sigma^{-1} x_t-2 x_t^T \Sigma^{-1} \mu+\mu^T \Sigma^{-1} \mu\right]+x_t^T g-\mu^T g\\ & =-\frac{1}{2}\left(x_t-\mu-\Sigma g\right)^T \Sigma^{-1}\left(x_t-\mu-\Sigma g\right)+\frac{1}{2} g^T \Sigma g+C_2, \\ & =-\frac{1}{2}\left(x_t-\mu-\Sigma g\right)^T \Sigma^{-1}\left(x_t-\mu-\Sigma g\right)+C_3 \\ \end{aligned} \]

\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right) \sim \mathcal{N}(\mu+\Sigma g, \Sigma) \]

For DDPM: \[ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z} \\ \] With Classifier Guidance: \[ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \left(\boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right) + \sigma_t g\right) \right)+\sigma_t \mathbf{z} \\ \]

\[ \sigma_t^2 =\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \]

for DDIM

\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)=Z \cdot p_\theta\left(x_t \mid x_{t+1}\right) \cdot p_\phi\left(y \mid x_t\right) \]

if we have a model $\epsilon_\theta\left(x_t\right)$ that predicts the noise added to a sample, then this can be used to derive a score function (In my view, it's merely a ratio coefficient.): \[ \nabla_{x_t} \log p_\theta(x_t \mid x_{t+1})= -\frac{1}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t) \] Then: \[ \begin{aligned} \nabla_{x_t} \log{p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)} &\approx \nabla_{x_t} \log p_\theta(x_t \mid x_{t+1}) + \nabla_{x_t} \log p_{\phi}(y \mid x_t)\\ &= -\frac{1}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t) + \nabla_{x_t} \log p_{\phi}(y \mid x_t)\\ &= -\epsilon_\theta(x_t) + \sqrt{1-\bar{\alpha}_t}\nabla_{x_t} \log p_{\phi}(y \mid x_t) \end{aligned} \] For DDIM \[ \begin{aligned} x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \epsilon_t +\sigma_t z\\ x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} \epsilon_t \\ \end{aligned} \] With Classifier Gradient: \[ \begin{aligned} x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} \hat{\epsilon}_t \\ &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} (\epsilon_t - \sqrt{1-\bar{\alpha}_t} \nabla_{x_t} \log p_\phi\left(y \mid x_t\right))\\ &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} (\epsilon_t - \sqrt{1-\bar{\alpha}_t} g)\\ \end{aligned} \]