Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models
Last time we saw the probability distribution of with a latent variable as follows:
and we said the key idea behind VAEs is to not sample from the whole distribution , but actually from a simpler distribution . The reason is because most of will likely to give close to zero, and therefore making little contribution to the estimation of . Now if we sample , those values of will more likely to generate in the training set. Moreover, we hope that will has less modes than , and therefore easier to sample from. The intuition of this is the locations of the modes of depends on , and this flexibility will compensate the limitation of the fact that is simpler than .
But how can help with modelling ? If is sampled from , then using we will get . We will then need to show the relationship of this quantity with , which is the actual quantity we want to estimate. The relationship between and is the backbone of VAEs.
We start with the KL divergence of and :
The unknown quantity in this equation is , but at least we can use Bayes rule for it:
Rearrange things a bit, and apply the definition of KL divergence between and , we have:
If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:
- The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing . In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
- If happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where encodes into , and then decodes to reconstruct , hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
- Note that on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize , and therefore pull closer to . If we use a flexible model for , then we can use as an approximation for . This is a nice side effect of the whole framework.
Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for , which gave rise to VAEs several years ago.
Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.