Classifier Guidance
Diffusion Models Beat GANs on Image Synthesis
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models.
Architecture Improvements
We explore the following architectural changes:
- increasing depth versus width, holding model size relatively constant;
- increasing the number of attention heads; using attention at 32$\(32, 16\)\(16, and 8\)\(8 resolutions rather than only at 16\)$16;
- using the BigGAN [8] residual block for upsampling and downsampling the activations, following[67];
- rescaling residual connections with \(\frac{1}{\sqrt{2}}\), following [67, 33, 34].
In the rest of the paper, we use this final improved model architecture as our default:
- 128 base channels
- BigGAN up/down sampling
- train the model for 700k iterations
- variable width with 2 residual blocks per resolution, multiple heads with 64 channels per head, attention at 32, 16 and 8 resolutions
- adaptive group normalization for injecting timestep and class embeddings into residual blocks
\[ \operatorname{AdaGN}(h, y)=y_s \operatorname{GroupNorm}(h)+y_b \]
where \(y_s\) and \(y_b\) are the linear projection of timestep and class embeddings, respectively.
Classifier Guidance (CG)
In particular, we can train a classifier \(p_\phi\left(y \mid x_t, t\right)\) on noisy images \(x_t\), and then use gradients \(\nabla_{x_t} \log p_\phi\left(y \mid x_t, t\right)\) to guide the diffusion sampling process towards an arbitrary class label \(y\).
Covariance Matrix
\[ var(X) = \frac{\sum_{i=1}^n (x_i - \overline{X})(x_i - \overline{X})}{n-1} \]
\[ cov(X, Y) = \frac{\sum_{i=1}^n (x_i - \overline{X})(y_i - \overline{Y})}{n-1} \]
- \(cov(X, X)=var(X)\)
- \(cov(X,Y)=cov(Y,X)\)
For n-dimension \[ C_{n \times n}=\left(c_{i, j}, c_{i, j}=\operatorname{cov}\left(\operatorname{Dim}_i, \operatorname{Dim}_j\right)\right) \] for \(n=3\): \[ \Sigma=\left(\begin{array}{ccc} \operatorname{cov}(x, x) & \operatorname{cov}(x, y) & \operatorname{cov}(x, z) \\ \operatorname{cov}(y, x) & \operatorname{cov}(y, y) & \operatorname{cov}(y, z) \\ \operatorname{cov}(z, x) & \operatorname{cov}(z, y) & \operatorname{cov}(z, z) \end{array}\right) \]
Determinant of the Covariance Matrix
The determinant of the covariance matrix \(|\Sigma|\) is a measure of the spread (or volume) of the Gaussian distribution in the multidimensional space.
Physical Meaning: - The determinant of \(\Sigma\) determines the volume of the ellipsoid that encloses a given probability region (e.g., \(68 \%\) for one standard deviation, \(95 \%\) for two, etc.). - Larger \(|\Sigma|\) : The distribution is more "spread out," indicating higher variance in the data. - Smaller \(|\Sigma|\) : The distribution is more "concentrated," indicating lower variance in the data.
Multivariate Gaussian Distribution
The Probability Density Function (PDF) of a Multivariate Gaussian is derived from first principles using the properties of linear algebra and calculus, combined with the properties of a Gaussian in one dimension.
Step 1: Start with the 1D Gaussian PDF
For a univariate Gaussian random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), the PDF is: \[ p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2} \right) \] This formula tells us that \(p(x)\) is the probability density of a single random variable \(x\), where:
- \(\mu\) is the mean,
- \(\sigma^2\) is the variance,
- \((x - \mu)^2\) is the squared distance from the mean.
Step 2: Generalize to Multiple Dimensions
Now consider a random vector \(\mathbf{x} \in \mathbb{R}^d\) following a multivariate Gaussian distribution:
\(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\)
where:
- \(\boldsymbol{\mu}\) is the mean vector \((d \times 1)\).
- \(\boldsymbol{\Sigma}\) is the covariance matrix \((d \times d)\).
In multiple dimensions, the Gaussian distribution considers:
- The covariance structure: How the dimensions are correlated (via \(\boldsymbol{\Sigma}\)).
- Ellipsoidal shapes: The Gaussian is no longer symmetric around the mean—it is shaped by \(\boldsymbol{\Sigma}\).
Step 3: Generalize the Squared Distance
In 1D, the term \(\frac{(x - \mu)^2}{\sigma^2}\) in the exponent measures how far \(x\) is from the mean, normalized by the variance. In multiple dimensions, this generalizes to the Mahalanobis distance: \[ (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \] Here:
- \((\mathbf{x} - \boldsymbol{\mu})\): A column vector representing the deviation from the mean.
- \(\boldsymbol{\Sigma}^{-1}\): The inverse covariance matrix, which weights the deviation according to the variance and correlation in each dimension.
- \((\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\): A scalar value representing the "distance" of \(\mathbf{x}\) from \(\boldsymbol{\mu}\), scaled by \(\boldsymbol{\Sigma}^{-1}\).
This quadratic form accounts for the spread and correlation in the multivariate space.
Step 4: Normalization Constant
In 1D, the normalization constant is \(\frac{1}{\sqrt{2\pi \sigma^2}}\). This ensures that the total probability integrates to 1. For d-dimensions, the normalization constant generalizes as: \[ \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \] Where:
- \((2\pi)^{d/2}\): Accounts for dd-dimensional space.
- \(|\boldsymbol{\Sigma}|^{1/2}\): The square root of the determinant of \(\boldsymbol{\Sigma}\), which measures the "volume" of the distribution. It ensures that the PDF integrates to 1 in dd-dimensional space.
Step 5: Combine Components
Finally, combining the normalization constant and the exponential term gives the PDF of the multivariate Gaussian: \[ p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]
Probability Density Function
The probability density function of a multivariate Gaussian distribution \(\mathcal{N}(\mu, \Sigma)\) is given by: \[ p(x)=\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right) \]
To get the logarithm of the probability density function, take the log of the equation above:
\[ \log p(x)=\log \left(\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}}\right)-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \]
This expands as:
\[ \log p(x)=-\frac{d}{2} \log (2 \pi)-\frac{1}{2} \log |\Sigma|-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \]
The first two terms ( \(-\frac{d}{2} \log (2 \pi)\) and \(-\frac{1}{2} \log |\Sigma|\) ) are constants with respect to \(x\). These are often grouped together into a single constant \(C\), because they don't affect gradient computations during optimization. So:
\[ \log p(x)=-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)+C \]
for DDPM
For DDPM, \(p(y \mid x_t, x_{t+1})=p(y \mid x_t)\)
\[ \begin{aligned} p\left(x_t \mid x_{t+1}, y\right) & =\frac{p\left(x_t, x_{t+1}, y\right)}{p\left(x_{t+1}, y\right)} \\ & =\frac{p\left(x_t, x_{t+1}, y\right)}{p\left(y \mid x_{t+1}\right) p\left(x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(x_{t+1}\right) p\left(y \mid x_t, x_{t+1}\right) }{p\left(y \mid x_{t+1}\right) p\left(x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(y \mid x_t, x_{t+1}\right)}{p\left(y \mid x_{t+1}\right)} \\ & =\frac{p\left(x_t \mid x_{t+1}\right) p\left(y \mid x_t\right)}{p\left(y \mid x_{t+1}\right)} \\ \end{aligned} \]
\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)=Z \cdot p_\theta\left(x_t \mid x_{t+1}\right) \cdot p_\phi\left(y \mid x_t\right) \]
- First one:
\[ \log{p_\theta\left(x_t \mid x_{t+1}\right)} = -\frac{1}{2}\left(x_t-\mu\right)^T \Sigma^{-1}\left(x_t-\mu\right) + C \]
\[ p_\theta\left(x_t \mid x_{t+1}\right) \sim N(\mu, \Sigma) \]
- Second one:
\[ \begin{aligned} \log p_\phi\left(y \mid x_t\right) & \left.\approx \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu}+\left.\left(x_t-\mu\right) \nabla_{x_t} \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu} \\ & =C_1 + \left(x_t-\mu\right) g \end{aligned} \]
\[ C_1 \text{ is constant and } g=\left.\nabla_{x_t} \log p_\phi\left(y \mid x_t\right)\right|_{x_t=\mu} \]
Combined: \[ \begin{aligned} \log{p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)} &= \log{Z} + \log{p_\theta\left(x_t \mid x_{t+1}\right)} + \log{p_\phi\left(y \mid x_t\right)} \\ & \approx-\frac{1}{2}\left(x_t-\mu\right)^T \Sigma^{-1}\left(x_t-\mu\right)+ C_2 + \left(x_t-\mu\right) g \\ & =-\frac{1}{2}\left[x_t^T \Sigma^{-1} x_t-2 x_t^T \Sigma^{-1} \mu+\mu^T \Sigma^{-1} \mu\right]+x_t^T g-\mu^T g\\ & =-\frac{1}{2}\left(x_t-\mu-\Sigma g\right)^T \Sigma^{-1}\left(x_t-\mu-\Sigma g\right)+\frac{1}{2} g^T \Sigma g+C_2, \\ & =-\frac{1}{2}\left(x_t-\mu-\Sigma g\right)^T \Sigma^{-1}\left(x_t-\mu-\Sigma g\right)+C_3 \\ \end{aligned} \]
\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right) \sim \mathcal{N}(\mu+\Sigma g, \Sigma) \]
For DDPM: \[ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z} \\ \] With Classifier Guidance: \[ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \left(\boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right) + \sigma_t g\right) \right)+\sigma_t \mathbf{z} \\ \]
\[ \sigma_t^2 =\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \]
for DDIM
\[ p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)=Z \cdot p_\theta\left(x_t \mid x_{t+1}\right) \cdot p_\phi\left(y \mid x_t\right) \]
if we have a model \(\epsilon_\theta\left(x_t\right)\) that predicts the noise added to a sample, then this can be used to derive a score function (In my view, it's merely a ratio coefficient.): \[ \nabla_{x_t} \log p_\theta(x_t \mid x_{t+1})= -\frac{1}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t) \] Then: \[ \begin{aligned} \nabla_{x_t} \log{p_{\theta, \phi}\left(x_t \mid x_{t+1}, y\right)} &\approx \nabla_{x_t} \log p_\theta(x_t \mid x_{t+1}) + \nabla_{x_t} \log p_{\phi}(y \mid x_t)\\ &= -\frac{1}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t) + \nabla_{x_t} \log p_{\phi}(y \mid x_t)\\ &= -\epsilon_\theta(x_t) + \sqrt{1-\bar{\alpha}_t}\nabla_{x_t} \log p_{\phi}(y \mid x_t) \end{aligned} \] For DDIM \[ \begin{aligned} x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \epsilon_t +\sigma_t z\\ x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} \epsilon_t \\ \end{aligned} \] With Classifier Gradient: \[ \begin{aligned} x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} \hat{\epsilon}_t \\ &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} (\epsilon_t - \sqrt{1-\bar{\alpha}_t} \nabla_{x_t} \log p_\phi\left(y \mid x_t\right))\\ &= \sqrt{\bar{\alpha}_{t-1}} x_0+\sqrt{1-\bar{\alpha}_{t-1}} (\epsilon_t - \sqrt{1-\bar{\alpha}_t} g)\\ \end{aligned} \]