In this post I will write about Autoencoders and Variational Autoencoders, where the former is used to compress some data into a dense representation that can be used in various applications, the latter extends Autoencodes with additional properties which makes it possible to generate data that looks to follow the original data distribution. Both of these fall into the class of Latent Variable Models.

Latent Variable Models

Latent Variable Models is a class of models that maps an observable variable $X$ to a latent (hidden/unknown) variable $Z$. Assume that for each $x_{i} \in X$ that we have an observable outcome $y_{i}$ from the set of outcomes $Y$. In this setup we assume that there is a hidden underlying process $P$ that relates $X$ to $Y$, that process is modeled by the latent variable $Z$.

Autoencoder

An autoencoder is a latent variable model that is composed by an encoder $E(x_i)$ and decoder $D(z_i)$ that describes the process of mapping $x_{i}$ to $x_{i}$, seems trivial, however the caveat here is that we impose limits on the latent variable $z_{i}$ to be a point in a lower (compared to $x_{i}$) dimensional space.

$$ E(x_{i}) = z_{i}, \qquad D(z_{i}) = \hat{x_i} $$

This restriction forces the model to learn the underlying process of compressing $x$ to dense representation $z$.

The loss function for training a Autoencoder is,

$$ L = \frac{1}{n} \sum_{1}^{n} ||x_{i} - D(E(x_{i}))||_2 $$

Variational Autoencoder

Another common restriction imposed on $Z$ is a prior distribution, often gaussian. This restriction can be enforced by minimizing the KL-divergence between the distribution of $Z$ and the prior during training. This restriction has some benefits where the most obvious is that it makes latent space much more easier to work with. Because we now know the distribution of $Z$, we can sample $\hat{z} \sim$ prior and use the decoder to generate a sample $\hat{x}$ that looks like it was drawn from $X$. By imposing this restriction we have gone from an Autoencoder to a Variational Autoencoder, as we will see, by imposing this restriction we have gone from something that just compressed data to something that can actually generate data.

If we were to dress it in more statistical formulations it would look like.

Prior distribution $p(z)$ imposes restrictions on the latent space. Likelihood $p(x|z)$ describes the mapping $z \rightarrow x$ and the posterior distribution $p(z|x)$ describes the reverse mapping. The process of generating a data point would then be,

$$ z \sim p(z) \qquad x \sim p(x | z)$$

According to Bayes theorem,

$$ p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$

where $p(x) = \int p(x|z)p(z) dz$, the integral is however intractable to calculate due to evaluation over all possible configurations of latent variables. This makes the evaluation of $p(z|x)$ also intractable.

What we can do is to find a distribution $q(z | x)$ that approximates the posterior distribution. We also need a way of evaluating how good the approximation is, KL-divergence measures how much information that is lost when using q to approximate p.

$$D_{KL}(q(z | x) || p(z | x)) = \int q(z | x) \log \frac{q(z | x)}{p(z | x)}dz = $$

$$ = \int q(z | x) \log q(z | x) dz - \int q(z | x) \log p(z, x) dz + \int q(z | x) \log p(x) dz = $$

where we used that $p(z, x) = p(z|x)p(x)$

$$ = \textbf{E}_q[\log q(z | x)] - \textbf{E}_q[\log p(z, x)]+ \log p(x) $$

where the expectation is over $q(z|x)$

Evidence is the likelihood function, $\log p(x)$, the Evidence Lower Bound (ELBO) is the lower bound for the Evidence,

$$ ELBO = \textbf{E}_q[\log p(z, x)] - \textbf{E}_q[\log q(z | x)] $$

With this we rewrite the above expression for the Evidence,

$$ \log p(x) = ELBO + D_{KL}(q(z | x) || p(z | x)) $$

One property for KL-divergence is that is always greater than zero, following that, minimizing the KL-divergence is the same as maximizing ELBO. To learn the posterior, instead of minimizing the intractable KL-divergence we can maximize the tractable ELBO instead.

We can rewrite the ELBO to be for datapoint $x_i$ accordingly,

$$ ELBO = \textbf{E}_q[ \log p(x_i|z)] - D(q(z | x_i) || p(z)) $$

where we have expanded the joint distribution and used that $D_{KL}(q(z | x_i) || p(z)) = E_q[\log q(z|x_i)] - E_q[\log p(z)]$

Relating it back to the discussion above, if we parametrize $p(z|x)$ and $p(x|z)$ with the encoder and decoder respectively, we have the variational autoencoder setup. If we then maximize ELBO we will minimize the reconstruction error, $\textbf{E}_q[ \log p(x_i|z)]$, and minimizing the regularization term, $D(q(z | x_i) || p(z))$ where we are restricting the distribution of the latent space with the prior $p(z)$.

When constructing the loss for a Variational Autoencoder we need to be able to sample latent variables from $q(z | x_i)$, however sampling from a distribution will not work when calculating the gradient during the training backward pass. The Reparameterization trick enables us to express $z$ using deterministic parameters of the underlying distribution.

If the prior $p(z)$ is chosen to be a gaussian, the reparameterization trick looks like

$$ z \sim q(z | x_i) \rightarrow z \sim N(\mu, \sigma) \rightarrow z_{x_i} = \mu_{x_i} + \epsilon \sigma_{x_i} \qquad \text{where} \quad \epsilon \sim N(0, 1) $$

so, instead of $z$ directly maps to a point in the latent space $z$ specifies the parameters of the underlying gaussian distribution, $z_{x_i} = [\mu_{x_i}, \sigma_{x_i}]$.

KL-divergence between two gaussians

$$ D(q(z | x_i) || p(z)) = \frac{\sigma_{q}}{{\sigma_{p}}} + \frac{\sigma_{q}^2 + (\mu_{q} - \mu_{p})^2}{2\sigma_{p}^2} - \frac{1}{2} $$

We now have everything for constructing the loss for a Variational Autoencoder.