In this post I will write about diffusion models, diffusion model has gotten an upswing with recent successes e.g. (DALL-E). Diffusion models falls in the class of generative models, compared to GANs they are much easier to train and do not require the adversarial setup that can be tricky to get right. There are however some drawbacks, while GANs can produce a one-time inference to get the final results, the diffusion model must traverse a chain of multiple (in some cases $> 4000$) inference steps in order to end up with the final results. The concept of diffusion models goes back further than 2020 when Ho et al. 2020 article was released, however in the article the authors make some key improvements that later sparked renewed interest in them.
In the post I will first introduce the concept of diffusion models and thereafter go into how they can be used, during that part I will reference to my work with them
Diffusion process
The diffusion process is a markovian process that takes some data $x_0$ and in discrete steps (timesteps) $T$ adds, often Gaussian, noise to it. For each transition from $x_t$ to $x_{t-1}$, $\text{noise}(x_t) > \text{noise}(x_{t-1})$ until $T$ where $x_T$ can be seen to be modeled by an isotropic Gaussian distribution $N(\mu, \Sigma$) where $\Sigma = \sigma^2 I$, i.e there is no cross correlation between elements in $x_{T}$.
Diffusion Process
The image below shows the forward diffusion process where noise is added by the markovian process $q(x_t| x_{t-1})$ until $T$ steps (timesteps) are reached. $p_\theta(x_{t-1}| x_{t})$ is the parameterized markov chain (model) that at each step reverses the diffusion process.
The diffusion process is described by,
$$ q(x_t | x_{t-1}) = N(x_t, \sqrt{1 - \beta_t}x_{t-1}, \beta) $$
where $\beta_t$ is a variance scheduler, in Ho et al. 2020](https://arxiv.org/pdf/2006.11239), the scheduler was set to increase linearly, $\beta_t = t \frac{0.02 - 0.0001}{1000} + 0.0001$ for $T = 1000$. With increased timesteps the previous state is scaled down by $\sqrt{1 - \beta_t}$ to reduce the increase of variance when noise is added, this makes the input to the diffusion model consistently scaled across timesteps. The diffusion process can now be simulated accordingly,
$$ x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta} \epsilon \quad \text{where} \quad \epsilon \sim N(0, 1) $$
If we reparameterize $q(x_t | x_{t-1})$ with $\alpha = 1 - \beta$ and defined $\bar{\alpha_{t}} = \Pi^t_{i} \alpha_i$ we will be able to write the diffusion process for all $t$ only conditioning on $x_{0}$,
$$ q(x_t | x_{0}) = N(x_t, \sqrt{\bar{\alpha}}, (1-\bar{\alpha})) $$
This follows from,
$$ x_{t} = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_1 $$
$$ x_{t} = \sqrt{\alpha_t} (\sqrt{\alpha_t} x_{t-2} + \sqrt{1-\alpha_t} \epsilon_2) + \sqrt{1-\alpha_t} \epsilon_1 $$
$$ x_{t} = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t + \alpha_{t-1}} \epsilon_2 + \sqrt{1 - \alpha_t} \epsilon_1 $$
From that we can iteratively express $x_t$ as a function of $x_{0}$,
$$ x_t = \sqrt{\bar{\alpha_t}} x_{0} + \sqrt{(1-\bar{\alpha_t})} \epsilon $$
Where we have used that, adding two Gaussian distributions, $N(\mu_1, \sigma_1^2)$ and $N(\mu_2, \sigma_2^2)$, results in a distribution $N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$. In the above case, $\sqrt{\alpha_t + \alpha_{t-1}}^2 + \sqrt{1 - \alpha_t}^2 = 1 - \alpha_t \alpha_{t-1}$
Having $x_t$ expressed as a function of $x_{0}$ makes it easy, during training, to randomly add noise corresponding to any timestep $t = 1….T$ without the need to go through all the timesteps, in the diffusion process, leading up to $t$.
Diffusion model - reversing the diffusion process
A diffusion model can be seen as a parameterized Markov chain, that reverses the diffusion process. The model transitions from a state $x_t$ to a state $x_{t-1}$, where $\text{noise}(x_t) > \text{noise}(x_{t-1})$, in this process the diffusion model “denoises” data into a noiseless data sample matching the original data distribution, after some steps $< T$. The diffusion model can be regarded as a latent variable model where all $x_1, x_{2} … x_T$ are latent variables produced by the diffusion process. If we where to draw some analogy with Variable Auto Encoders, the diffusion process could be regarded as an encoder and the diffusion model as a decoder, one significant difference is however that the latent variables are of the same dimensions as the original data.
The reversal of one diffusion step is expressed by, $$ p_\theta(x_{t-1}| x_{t}) = N(x_{t-1} | \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)) $$
where $\theta$ is the set of parameters that parameterize, i.e. describing the conditional terms acting on $x_{t-1}$ from $x_t$, the Gaussian distribution that expresses the transition probabilities from $x_t$ to $x_{t-1}$. In Ho et al. 2020 $\Sigma_{\theta}$ is kept deterministically determined and outside of the model estimates, we can thereby rewrite the above expression as,
$$ p_\theta(x_{t-1}| x_{t}) = N(x_{t-1} | \mu_{\theta}(x_{t}, t), \Sigma(t)) $$
The model that reverse the diffusion process is an approximation of that process, in order to understand how good our approximation is we need to be able to compare it to the true reverse diffusion process in a similar way as we do for Variable Auto Encoders
If we again look at the diffusion process, but instead look at the transition from $x_t$ to $x_{t-1}$ namely $q(x_t | x_{t-1})$, we can express it as,
$$ q(x_{t-1} | x_{t}, x_{0}) = N(x_{t} | \tilde{\mu}(x_{t}, x_{0}), \tilde{\beta_t}) $$
We want to express the above expression in something that we can calculate given the dynamics expressed in the section above. Bayes' rule states $p(A | B) = \frac{p(A, B)}{p(B)}$. In our case we can write the above q-expression as,
$$ q(x_{t-1} | x_{t}, x_{0}) = \frac{q(x_t , x_{t-1}, x_{0})}{q(x_t, x_0)} = \frac{q(x_t | x_{t-1}, x_{0}) q(x_{t-1} , x_0)}{q(x_t, x_0)} = $$
$$ = q(x_t | x_{t-1}, x_{0}) \frac{q(x_{t}|x_{0})q(x_o)}{q(x_{t-1}|x_{0})q(x_o)} = q(x_t | x_{t-1}, x_{0}) \frac{q(x_{t}|x_{0})}{q(x_{t-1}|x_{0})} $$
all terms to the right can be expressed by known Gaussian distributions. By adding the known Gaussian distributions to the above expression, $\tilde{\mu}(x_{t}, x_{0})$ and $\tilde{\beta}$ can be calculated.
$$ \tilde{\mu}(x_{t}, x_{0}) = \frac{1}{\sqrt{\alpha_t}} \left( x_{t} - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_t \right) $$
$$ \tilde{\beta_t} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t $$
Similarly as when training a Variable Auto Encoders we can optimize over KL-Divergence, that reduces to the ELBO, between the process that generate the latent variable ($q$) and model that reverses that process ($p$), in our case - $D_{KL}(q(x_{t-1} | x_{t}, x_{0}) || p_\theta(x_{t-1}| x_{t}))$.
In the article Ho et al. 2020 the authors showed that the more complex training objective could be reduced to only minimizing the distance between the added noise and some estimate of the added noise. In that case the model outputs an estimate of the added noise $\epsilon_{\theta}(x_{t}, t)$, the training objective then becomes to minimize,
$$ L_{\text{simple}} = || \epsilon_{\theta}(x_{t}, t) - \epsilon_t||^2 $$
So, insead of the model outputting $x_{t-1}$ the model will output the estimate for the added noise $ \epsilon_{\theta}(x_{t}, t)$. When we are reversing the diffusion process we will gradually, one time step at a time, remove the estimated noise from $x_{t}$ to form $x_{t-1}$ accordingly,
$$ \mu_{\theta}(x_{t}, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_θ(x_t, t) \right) $$
$$ \Sigma(t) = \left( \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t \right) $$
$$ x_{t-1} = \mu_{\theta}(x_{t}, t) + \epsilon_{\text{posterior}} \Sigma(t) \quad \text{where} \quad \epsilon_{\text{posterior}} \sim N(0, 1) $$
So how can it be used? - applications
From the above sections we now know that a diffusion model can reverse the diffusion process turning noisy data into something that looks like it came from the original distribution of data. We can go even further and “denoise” the isotropic Gaussian noise at $t = T$, thereby letting prue noise drive the generation process of new novel data samples.
In my work with them, repository, notebook I have been using the celebA dataset to train a Unet like network to reverse the diffusion process to generate novel data samples, in this case faces.
Each row shows the reversal of the diffusion process for timesteps that are linearly spaced between $t = T/2 = 500$ and $t=0$.
In those cases where we want to generate new data samples that should share some high level characteristics with some known sample, we can add an appropriate amount of noise so that the noised sample $x_t$ does not contain details but still retain the high level characteristics that we want to keep.
Left image shows the original image $x_0$, the middle one $x_{100}$ and the right one shows the final output of the reversal of the diffusion process.
In the cases above, the key characteristics of the faces have been preserved while the generated faces do not depict the same person.
We can also use the model to denoise an interpolated image of two faces, see the below image, where I have interpolated the two right images into the middle one $x_{0}$, the fourth one shows $x_{200}$, the final image shows the final output.
Extra
Markov Chains
A Markov chain consist of different states $s_i \in S$, the transition from one state $s_i$ to any other $s_j$ is only determined by the current state, i.e. the probability of transiting from current state $s_i$ to $s_j$ is specified by $p(x_j | x_i)$, i.e. no past visited states will have an impact on the state transition probabilities.