DDPM

Denoising Diffusion Probabilistic Models

Hints:

  1. Goal: maximize log likelihood of posterior distribution
  2. Directly computing the posterior distribution is hard for calculating the evidence
  3. Therefore, we use variational inferrence to minimize \(D_{KL}{( q |\ p_{\theta})}\) ( \(q\) is known and is built as a markov chain)
  4. We use latent variable and markov chain to help us compute the gradient of the distribution, rather then directly computing the distribution itself.

The process of generating data with a DDPM involves two main phases:

  1. forward process (diffusion): \(q\)
  2. reverse process (denoising): \(p_{\theta}\)

Latent Variable

Latent variables are random variables that are within the same model as the observed random variables but have not been directly observed. Latent variables have a direct impact on the observed data and model parameters. Therefore, when the model objective, such as the log likelihood mentioned above, cannot be directly solved, latent variables can be used to attempt to address the problem.

  • \(\mathbf{x_0}\) : raw image
  • \(\mathbf{x_T}\) : noise
  • \(\mathbf{x}_1, \cdots, \mathbf{x}_T\) denoted as \(\mathbf{x}_{1: T}\) : Latent Variable

\[ p_\theta\left(\mathbf{x}_0\right):= \int \cdots \int p\left(\mathbf{x}_0, \mathbf{x}_1, \cdots, x_T\right) d \mathbf{x}_1 \cdots d \mathbf{x}_T := \int p_\theta\left(\mathbf{x}_{0: T}\right) d \mathbf{x}_{1: T} \]

\[ \log p_\theta\left(\mathbf{x}_0\right)=\log \int p_\theta\left(\mathbf{x}_0, \mathbf{x}_{1: T}\right) d \mathbf{x}_{1: T} \]


Variational Inference

Mathematically, variational inference often employs the Kullback-Leibler Divergence (\(\mathrm{KL}\) divergence, denoted as \(\mathcal{D}_{K L}\)) for approximation, which measures the similarity between two distributions, with a scalar value range.


ELBO

The term \(D_{KL}\left(q_\phi(z \mid x) \| p_\theta(z \mid x)\right)\) is used to make the learned distribution \(q_\phi(z \mid x)\) approach the true posterior distribution \(p_\theta(z \mid x)\). \[ \log p_\theta(x)=D_{K L}\left(q_\phi(z \mid x) \| p_\theta(z \mid x)\right)+\mathcal{L}(\theta, \phi ; x) \] \(\mathcal{L}(\theta, \phi; x)\) represents the variational lower bound, also known as the Evidence Lower BOund (ELBO).

Assuming that the true distribution of \(\mathrm{x}\) remains unchanged, i.e., \(\log p_\theta(x)\) is constant, the closer \(q_\phi(z \mid x)\) is to \(p_\theta(z \mid x)\), the smaller \(D_{KL}\left(q_\phi(z \mid x) \| p_\theta(z \mid x)\right)\) becomes, and the larger \(\mathcal{L}(\theta, \phi; x)\) becomes.

Therefore, our optimization direction is to maximize \(\mathcal{L}(\theta, \phi; x)\), thus making the learned posterior distribution \(q_\phi(z \mid x)\) increasingly close to the true posterior distribution \(p_\theta(z \mid x)\).


Variational Bound

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior \(q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)\), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule \(\beta_1, \ldots, \beta_T\) : \[ q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right), \quad q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \]

\[ \theta=\underset{\theta}{\arg \min } \mathcal{D}_{K L}\left(q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)\right) \]

\(\mathbf{x}_{1: T}\) is denoted as \(\mathbf{z}\) , \(p_{\theta}\) is denoted as \(p\) , since \(p(z|x)\) is hard to calculate, we have \[ \begin{aligned} \mathcal{D}_{K L}(q(z \mid x) \| p(z \mid x)) & =\mathbb{E}_q\left[\log \frac{q(z \mid x)}{p(z \mid x)}\right] \\ & =\mathbb{E}_q\left[\log \frac{q(z \mid x) p(x)}{p(z, x)}\right] \\ & =\mathbb{E}_q\left[\log \frac{q(z \mid x)}{p(z, x)}+\log p(x)\right] \\ & = \underbrace{\mathbb{E}_q\left[\log \frac{q(z \mid x)}{p(z, x)}\right]}_{-ELBO} +\mathbb{E}_q\left[ \log p(x)\right] \end{aligned} \]

\[ \mathbb{E}_q\left[ \log p(x)\right] = \mathcal{D}_{K L}(q(z \mid x) \| p(z \mid x)) + \underbrace{\mathbb{E}_q\left[ - \log \frac{q(z \mid x)}{p(z, x)}\right]}_{ELBO} \]

Since \(\mathcal{D}_{K L}(q(z \mid x) \| p(z \mid x)) >= 0\), \(\mathbb{E}_q[-\log p(x)] \leq \mathbb{E}_q\left[-\log \frac{p(z, x)}{q(z \mid x)}\right]\), therefore: \[ \begin{aligned} \mathbb{E}_q\left[-\log p_\theta\left(\mathbf{x}_0\right)\right] \leq \mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] \end{aligned} \] where $p_({0: T}) := p(T) {t=1}^T p(_{t-1} t) $ and $q({1: T} 0) := {t=1}^T q(t {t-1}) $

\(p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) :=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right)\) and \(q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) :=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)\)

Loss

Therefore, the variational bound of negative \(ELBO\), or the bound of negative log likelihood, can be formed to the loss function \(L\) as follows: \[ \begin{aligned} \mathbb{E}_q\left[-\log p_\theta\left(\mathbf{x}_0\right)\right] & \leq \mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right) \prod_{t=1}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] := L \end{aligned} \]

We can further reformulate the loss function \(L\) \[ \begin{aligned} L & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \end{aligned} \] Since \[ \begin{aligned} q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) & =q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0\right) \\ & =\frac{q\left(\mathbf{x}_t, \mathbf{x}_{t-1}, \mathbf{x}_0\right)}{q\left(\mathbf{x}_{t-1}, \mathbf{x}_0\right)} \\ & =q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \frac{q\left(\mathbf{x}_t, \mathbf{x}_0\right)}{q\left(\mathbf{x}_{t-1}, \mathbf{x}_0\right)} \\ & =q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \frac{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right) q\left(\mathbf{x}_0\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right) q\left(\mathbf{x}_0\right)} \\ & =q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \frac{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)} \end{aligned} \] Therefore: \[ \begin{aligned} L & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)} \cdot \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \\ & = \mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)} -\sum_{t>1} \log q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)+\sum_{t>1} \log q\left(\mathbf{x}_t \mid \mathbf{x}_0\right) -\log{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)} + \log{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)} \right] \\ & = \mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)} -\sum_{t>1} \log q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)+ \sum_{t \geq 1} \log q\left(\mathbf{x}_t \mid \mathbf{x}_0\right) -\log{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)} \right] \\ & =\mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right)}{q\left(\mathbf{x}_T \mid \mathbf{x}_0\right)}-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)}-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \\ & =\mathbb{E}_q\left[D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)+\sum_{t>1} D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \end{aligned} \]

Efficient training is therefore possible by optimizing random terms of \(L\) with stochastic gradient descent. \[ \mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t>1} \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{1:T-1}} \underbrace{-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}_{L_0}] \] This equation uses KL divergence to directly compare \(p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\) against forward process posteriors \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\)


Diffusion and Denoise

Noise Schedule

In the noise schedule of DDPM (Diffusion Probabilistic Models), \(\beta_t\) typically represents the magnitude of noise added at time step \(t\), rather than the total noise at time \(t\).

Diffusion (Forward) Process

\(\beta_t:=1-\alpha_t\), \(\bar{\alpha}_t:=\prod_{s=1}^t \alpha_s\)

manually-defined variance schedule \(\beta_1, \ldots, \beta_T\) \[ q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\alpha_t} \mathbf{x}_{t-1}, (1-\alpha_t) \mathbf{I}\right) \]

\[ q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right) \]

\[ \mathbf{x}_t=\sqrt{\alpha_t} \mathbf{x}_{t-1}+\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t} \]

\[ \mathbf{x}_t=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \, \overline{\boldsymbol{\epsilon}}_t \]

where \(\alpha_t\) is a small hyper parameter , \(\boldsymbol{\epsilon}_{t} \sim N(0,1)\) is Guassion Noise. \[ \operatorname{SNR}(t)=\frac{\overline{\alpha_t}}{1-\overline{\alpha_t}} \]

Denoise (Reverse) Process

\[ p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \sigma_t^2 \mathbf{I} \right) \]

\[ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z} \\ \]

Where \[ \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right) \]

\[ \sigma_t^2 =\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \]

\(\sigma_t^2=\beta_t\) and

\(\sigma_t^2=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t\) had similar results

since \(\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \to 1\)

The first choice is optimal for \(\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) , and the second is optimal for \(\mathbf{x}_0\) deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance.

Predicting \(\epsilon\) is essentially adding a Signal-to-Noise Ratio (SNR) as a coefficient to the prediction of \(x_0\). This makes the reward for correct inference greater under high SNR conditions. \[ \frac{\sqrt{\bar{\alpha}_{t-1}}({1-{\alpha}_t}) } {1-\bar{\alpha}_t} / \Big( \frac{1}{\sqrt{\alpha_t}} \cdot \frac{1 - \alpha_t}{\sqrt{1-\bar{\alpha}_t}} \Big) = \frac{\sqrt{\overline{\alpha}_t}}{\sqrt{1-\bar{\alpha}_t}} = \sqrt{SNR} \]

Proof for Forward x0

\[ \begin{align} \mathbf{x}_t &=\sqrt{\alpha_t} \mathbf{x}_{t-1}+\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t}\\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2}+\sqrt{1-\alpha_{t-1}} \boldsymbol{\epsilon}_{t-1}) +\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t} \\ &= \sqrt{\alpha_t}\Big( \sqrt{\alpha_{t-1}} (\sqrt{\alpha_{t-2}} \mathbf{x}_{t-3}+\sqrt{1-\alpha_{t-2}} \boldsymbol{\epsilon}_{t-2}) +\sqrt{1-\alpha_{t-1}} \boldsymbol{\epsilon}_{t-1} \Big) +\sqrt{1-\alpha_t} \boldsymbol{\epsilon}_{t} \\ & \cdots \\ &= \sqrt{\alpha_{t} \alpha_{t-1} \cdots \alpha_{2} \alpha_{1}} \mathbf{x}_0 + \sqrt{\alpha_{t} \alpha_{t-1} \cdots \alpha_{2} (1 - \alpha_{1}) } \boldsymbol{\epsilon}_{1} + \sqrt{\alpha_{t} \alpha_{t-1} \cdots (1-\alpha_{2}) } \boldsymbol{\epsilon}_{2} + \cdots + \sqrt{\alpha_{t} (1 - \alpha_{t-1}) }\boldsymbol{\epsilon}_{t-1} + \sqrt{1 - \alpha_{t}} \boldsymbol{\epsilon}_{t} \end{align} \]

By leveraging the additive property of normal distributions, their sum also forms a normal distribution with a mean of 0 and a variance equal to the sum of the squares of the coefficients. Therefore: \[ \begin{align} var &= \big[{\alpha}_{t} {\alpha}_{t-1} \cdots {\alpha}_{2} (1 - {\alpha}_{1})\big] + \big[{\alpha}_{t} {\alpha}_{t-1} \cdots (1-{\alpha}_{2})\big] + \cdots + \big[{\alpha}_{t} (1 - {\alpha}_{t-1})\big] + \big[1 - {\alpha}_{t} \big] \\ &= {\alpha}_{t} {\alpha}_{t-1} \cdots {\alpha}_{2} (1 - {\alpha}_{1}) + {\alpha}_{t} {\alpha}_{t-1} \cdots (1-{\alpha}_{2}) + \cdots + 1 - {\alpha}_{t} {\alpha}_{t-1} \\ & \cdots \\ &= 1 - {\alpha}_{t} {\alpha}_{t-1} \cdots {\alpha}_{2} {\alpha}_{1} \\ &= 1 - \overline{\alpha}_t \end{align} \] Therefore: \[ \mathbf{x}_t=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \, \overline{\boldsymbol{\epsilon}}_t \]

\[ q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right) \]

\[ \operatorname{SNR}(t)=\frac{\overline{\alpha_t}}{1-\overline{\alpha_t}} \]

Proof for reverse

\[ \begin{align} q\left(x_{t-1} \mid x_t\right) \sim N\left(\mu, \sigma_t^2\right)&=N\left(\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1 - \alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_t\right), \sigma_t^2 \right) \\ \mathbf{x}_{t-1}&=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_t\right)+\sigma_t \mathbf{z} \\ \end{align} \]

The probability density function (PDF) of a normal distribution, also known as the Gaussian distribution, is given by: \[ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \] The reverse process of the diffusion model (generative process) is a parameterized Gaussian distribution. \[ \begin{aligned} q\left(x_{t-1} \mid x_t, x_0\right) & =\frac{q\left(x_t \mid x_{t-1}, x_0\right) q\left(x_{t-1} \mid x_0\right)}{q\left(x_t \mid x_0\right)} \\ & =\frac{q\left(x_t \mid x_{t-1}\right) q\left(x_{t-1} \mid x_0\right)}{q\left(x_t \mid x_0\right)} \end{aligned} \] Starting from the probability density function, the probability density of \(q(x_{t-1} \mid x_t, x_0)\) is as follows: \[ \begin{aligned} & \frac{\frac{1}{\sqrt{2 \pi} \sqrt{1-\alpha_t}} e^{-\frac{\left(x_t-\sqrt{\alpha_t} x_{t-1}\right)^2}{2\left(1-\alpha_t\right)}} \frac{1}{\sqrt{2 \pi} \sqrt{1-\bar{\alpha}_{t-1}}} e^{-\frac{\left(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}} x_0\right)^2}{2\left(1-\bar{\alpha}_{t-1}\right)}}}{\frac{1}{\sqrt{2 \pi} \sqrt{1-\overline{\alpha_t}}} e^{-\frac{\left(x_t-\sqrt{\bar{\alpha}_t} x_0\right)^2}{2\left(1-\bar{\alpha}_t\right)}}} \\ & =\frac{1}{\sqrt{2 \pi} \sqrt{\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t}}} e^{-\left[\frac{\left(x_t-\sqrt{\alpha_t} x_{t-1}\right)^2}{2\left(1-\alpha_t\right)}+\frac{\left(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}} x_0\right)^2}{2\left(1-\bar{\alpha}_{t-1}\right)}-\frac{\left(x_t-\sqrt{\bar{\alpha}_t} x_0\right)^2}{2\left(1-\bar{\alpha}_t\right)}\right]} \end{aligned} \] Therefore \[ \sigma^2=\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \] Since \(q\left(x_t \mid x_0\right) \sim N\left(\sqrt{\overline{\alpha_t}} x_0,\left(1-\bar{\alpha}_t\right) I\right)\) , for the exponential part \[ \begin{aligned} & -\left[\frac{\left(x_t-\sqrt{\alpha_t} x_{t-1}\right)^2}{2\left(1-\alpha_t\right)}+\frac{\left(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}} x_0\right)^2}{2\left(1-\bar{\alpha}_{t-1}\right)}-\frac{\left(x_t-\sqrt{\bar{\alpha}_t} x_0\right)^2}{2\left(1-\bar{\alpha}_t\right)}\right] \\ & =-\left[\frac{\left(x_t-\sqrt{\alpha_t} x_{t-1}\right)^2}{2 \sigma^2} * \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}+\frac{\left(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}} x_0\right)^2}{2 \sigma^2} * \frac{1-\alpha_t}{1-\bar{\alpha}_t}-\frac{\left(x_t-\sqrt{\bar{\alpha}_t} x_0\right)^2}{2 \sigma^2} * \frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{\left(1-\bar{\alpha}_t\right)^2}\right] \\ & =-\frac{1}{2 \sigma^2}\left[x_{t-1}^2-2 \frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) x_0}{1-\bar{\alpha}_t} x_{t-1}+\left(\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) x_0}{1-\bar{\alpha}_t}\right)^2\right] \\ & =-\frac{1}{2 \sigma^2}\left[x_{t-1}-\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) x_0}{1-\bar{\alpha}_t}\right]^2 \end{aligned} \]

\[ \mu=\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) x_0}{1-\bar{\alpha}_t} \]

\[ \begin{aligned} & x_t=\sqrt{\bar{\alpha}_t} x_0+\sqrt{1-\bar{\alpha}_t} \epsilon_t \\ &=>\quad x_0=\frac{x_t-\sqrt{1-\overline{\alpha_t}} \epsilon_t}{\sqrt{\overline{\alpha_t}}} \end{aligned} \]

\[ \begin{aligned} \mu & =\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) x_0}{1-\bar{\alpha}_t} \\ & =\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) x_t+\frac{1}{\sqrt{\alpha_t}}\left(1-\alpha_t\right)\left(x_t-\sqrt{1-\bar{\alpha}_t} \epsilon\right)}{1-\bar{\alpha}_t} \\ & =\frac{1}{\sqrt{\alpha_t}} \frac{\left.\alpha_t\left(1-\bar{\alpha}_{t-1}\right) x_t+\left(1-\alpha_t\right) x_t-\left(1-\alpha_t\right) \sqrt{1-\bar{\alpha}_t} \epsilon\right)}{1-\bar{\alpha}_t} \\ & =\frac{1}{\sqrt{\alpha_t}} \frac{\left.\left(1-\bar{\alpha}_t\right) x_t-\left(1-\alpha_t\right) \sqrt{1-\bar{\alpha}_t} \epsilon\right)}{1-\bar{\alpha}_t} \\ & =\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right) \end{aligned} \]

\[ \begin{aligned} \mu & =\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right) \\ \sigma^2 & =\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \end{aligned} \]

\[ \begin{align} q\left(x_{t-1} \mid x_t\right) \sim N\left(\mu, \sigma_t^2\right)&=N\left(\frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{1 - \alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_t\right), \sigma_t^2 \right) \\ \mathbf{x}_{t-1}&=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_t\right)+\sigma_t \mathbf{z} \\ \end{align} \]

\[ \begin{cases} \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_t\right)+\sigma_t \mathbf{z} \\ x_t=\sqrt{\overline{\alpha_t}} x_0+\sqrt{1-\overline{\alpha_t}} \varepsilon_t \rightarrow \varepsilon_t =\frac{x_t-\sqrt{\overline{\alpha_t}}x_0}{\sqrt{1-\overline{\alpha_t}}} \end{cases} \] Therefore: \[ \begin{aligned} & \mathbf{x}_{t-1}= \frac{1}{\sqrt{\alpha_t}} \big( \mathbf{x}_t- \frac{(1-{\alpha}_t)({\mathbf{x}}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0) }{1-\overline{\alpha}_t} \big) +\sigma_t \mathbf{z} \\ & \mathbf{x}_{t-1}= \frac{1}{\sqrt{\alpha_t}} \left(\frac{\alpha_t - \bar{\alpha}_t}{1-\bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_t}({1-{\alpha}_t}) }{1-\bar{\alpha}_t} \mathbf{x}_0 \right)+\sigma_t \mathbf{z} \\ & \mathbf{x}_{t-1}= \frac{\sqrt{\bar{\alpha}_{t-1}}({1-{\alpha}_t}) }{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t +\sigma_t \mathbf{z} \\ \end{aligned} \]


Loss function

\[ \mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t>1} \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{1: T-1}} \underbrace{-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}_{L_0}] \]

\[ L_{\text {simple }}(\theta):=\mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right] \]


  1. Forward process and \(L_T\)

    We ignore the fact that the forward process variances \(\beta_t\) are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior \(q\) has no learnable parameters, so \(L_T\) is a constant during training and can be ignored.

  2. Reverse process and \(L_{1: T-1}\)

    From the section above: \[ \begin{align} p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \sigma_t^2 \mathbf{I} \right) \\ \mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z} \\ \end{align} \] Using kl divergence between two univariate Gaussians: \[ L_{t-1}=\mathbb{E}_q\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right]+C \] \(\text { where } C \text { is a constant that does not depend on } \theta\) \[ \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t\left(1-\bar{\alpha}_t\right)}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right] \]

  3. Data scaling, reverse process decoder, and \(L_0\)

\[ \log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right) = \log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right)} \cdot q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right) = \log \frac{q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right)}{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}+\log q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right) \]

\[ \log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right) = \mathcal{D}_{K L}\left(q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right)+\log q\left(\mathbf{x}_0 \mid \mathbf{x}_1, \mathbf{x}_0\right) \]

The \(t=1\) case corresponds to \(L_0\)


KL divergence between two univariate Gaussians

\(\text { Let } p(x)=N\left(\mu_1, \sigma_1\right) \text { and } q(x)=N\left(\mu_2, \sigma_2\right) \text {. }\) \[ \begin{aligned} K L(p, q) & =-\int p(x) \log q(x) d x+\int p(x) \log p(x) d x \\ & =\frac{1}{2} \log \left(2 \pi \sigma_2^2\right)+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2}\left(1+\log 2 \pi \sigma_1^2\right) \\ & =\log \frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2} \end{aligned} \]