# Variational Autoencoders 2: Maths

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Last time we saw the probability distribution of $X$ with a latent variable $z$ as follows:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$  (1)

and we said the key idea behind VAEs is to not sample $z$ from the whole distribution $P\left(z\right)$, but actually from a simpler distribution $Q\left(z\vert X\right)$. The reason is because most of $z$ will likely to give $P\left(X\vert z\right)$ close to zero, and therefore making little contribution to the estimation of $P\left(X\right)$. Now if we sample $z \sim Q\left(z\vert X\right)$, those values of $z$ will more likely to generate $X$ in the training set. Moreover, we hope that $Q$ will has less modes than $P\left(z\right)$, and therefore easier to sample from. The intuition of this is the locations of the modes of $Q\left(z\vert X\right)$ depends on $X$, and this flexibility will compensate the limitation of the fact that $Q\left(z\vert X\right)$ is simpler than $P\left(z\right)$.

But how $Q\left(z\vert X\right)$ can help with modelling $P\left(X\right)$? If $z$ is sampled from $Q$, then using $f$ we will get $E_{z \sim Q}P\left(X\vert z\right)$. We will then need to show the relationship of this quantity with $P\left(X\right)$, which is the actual quantity we want to estimate. The relationship between $E_{z \sim Q}P\left(X\vert z\right)$ and $P\left(X\right)$ is the backbone of VAEs.

We start with the KL divergence of $Q\left(z\vert X\right)$ and $P\left(z\vert X\right)$:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(z\vert X\right)\right]$

The unknown quantity in this equation is $P\left(z\vert X\right)$, but at least we can use Bayes rule for it:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(X\vert z\right) - \log P\left(z\right)\right] + \log P\left(X\right)$

Rearrange things a bit, and apply the definition of KL divergence between $Q\left(z\vert X\right)$ and $P\left(z\right)$, we have:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$    (2)

If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:

• The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing $P\left(X\right)$. In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
• If $Q$ happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where $Q$ encodes $X$ into $z$, and then $P$ decodes $z$ to reconstruct $X$, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
• Note that $P\left(z\vert X\right)$ on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize $\mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right]$, and therefore pull $Q\left(z\vert X\right)$ closer to $P\left(z\vert X\right)$. If we use a flexible model for $Q$, then we can use $Q$ as an approximation for $P\left(z\vert X\right)$. This is a nice side effect of the whole framework.

Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for $Q$, which gave rise to VAEs several years ago.

Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.