# Variational Autoencoders 1: Overview

In a previous post, we briefly mentioned some recent approaches for Generative Modeling. Among those, RBMs and DBMs are probably very tricky because the estimation of gradients in those models is based on a good mixing of MCMC, which tends to get worse during the course of training because the model distribution gets sharper. Autogressive models like PixelRNN, WaveNet, etc… are easier to train but have no latent variables, which makes them somewhat less powerful. Therefore, the current frontier in Generative Modelling is probably GANs and Variational Autoencoders (VAEs).

While GANs are too mainstream, I thought I can probably write a post or two about Variational Autoencoders, at least to clear up some confusions I am having about them.

Formally, generative modeling is the area in Machine Learning that deals with models of distributions $P(X)$, defined over datapoints $X$ in some high-dimensional space $\mathcal{X}$. The whole idea is to construct models of $P(X)$ that assigns high probabilities to data points similar to those in the training set, and low probabilities every where else. For example, a generative models of images of cows should assign small probabilities to images of human.

However, computing the probability of a given example is not the most exciting thing about generative models. More often, we want to use the model to generate new samples that look like those in the training set. This “creativity” is something unique to generative models, and does not exist in, for instance, discriminative models. More formally, say we have a training set sampled from an unknown distribution $P_\text{org}(X)$, and we want to train a model $P$ which we can take sample from, such that $P$ is as close as possible to $P_\text{org}$.

Needless to say, this is a difficult problem. To make it tractable, traditional approaches in Machine Learning often have to 1) make strong assumptions about the structure of the data, or 2) make severe approximation, leading to suboptimal models, or 3) rely on expensive sampling procedures like MCMC. Those are all limitations which make Generative modeling a long-standing problem in ML research.

Without further ado, let’s get to the point. When $\mathcal{X}$ is a high-dimensional space, modeling is difficult mostly because it is tricky to handle the inter-dependencies between dimensions. For instance, if the left half of an image is a horse then probably the right half is likely another horse.

To further reduce this complexity, we add a latent variable $z$ in a high-dimensional space $\mathcal{Z}$ that we can easily sample from, according to a distribution $P(z)$ defined over $\mathcal{Z}$. Then say we have a family of deterministic function $f(z;\theta)$ parameterized by a vector $\theta$ in some space $\Theta$ where $f: \mathcal{Z} \times \Theta \rightarrow \mathcal{X}$. Now $f$ is deterministic, but since $z$ is a random variable, $f(z;\theta)$ is a random variable in $\mathcal{X}$.

During inference, we will sample $z$ from $P(z)$, and then train $\theta$ such that $f(z;\theta)$ is close to samples in the training set. Mathematically, we want to maximize the following probability for every sample $X$ in the training set:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$   (1)

This is the good old maximum likelihood framework, but we replace $f(z;\theta)$ by $P\left(X\vert z;\theta\right)$ (called the output distribution) to explicitly indicate that $X$ depends on $z$, so that we can use the integral to make it a proper probability distribution.

There are a few things to note here:

• In VAEs, the choice of the output distribution is often Gaussian, i.e. $P\left(X\vert z;\theta\right) = \mathcal{N}\left(X; f(z;\theta), \sigma^2 * I\right)$, meaning it is a Gaussian distribution with mean $f(z;\theta)$ and diagonal covariance matrix where $\sigma$ is a scalar hyper-parameter. This particular choice has some important motivations:
• We need the output distribution to be continuous, so that we can use gradient descent on the whole model. It wouldn’t be possible if we use discontinuous function like the Dirac distribution, meaning to use exactly the output value of $f(z;\theta)$ for $X$.
• We don’t really need to train our model such that $f(z;\theta)$ produces exactly some sample X in the training set. Instead, we want it to produce samples that are merely like X. In the beginning of training, there is no way for $f$ to gives exact samples in the training set. Hence by using a Gaussian, we allow the model to gradually (and gently) learn to produce samples that are more and more like those in the training set.
• It doesn’t have to be Gaussian though. For instance, if $X$ is binary, we can make $P\left(X\vert z;\theta\right)$ a Bernoulli parameterized by $f(z;\theta)$. The important property is that $P\left(X\vert z\right)$ can be computed, and is continuous in the domain of $\theta$.
• The distribution of $z$ is simply the normal distribution, i.e. $P(z) = \mathcal{N}\left(0,I\right)$. Why? How is it possible? Is there any limitation with this? A related question is why don’t we have several levels of latent variables. which potentially might help modelling complicated processes?
All those question can be answered by the key observation that any distribution in $d$ dimensions can be generated by taking $d$ variables from the normal distribution and mapping them through a sufficiently complicated function.
Let that sink for a moment. Readers who are interested in the mathematical details can have a look at the conditional distribution method described in this paper. You can also convince yourself if you remember how we can sample from any Gaussian as described in an earlier post.
Now, this observation means we don’t need to go to more than one level of latent variable, with a condition that we need a sufficiently complicated function for $f(z;\theta)$. Since deep neural nets has been shown to be a powerful function approximator, it makes a lot of sense to use deep neural nets for modeling $f$.
• Now the only business is to maximize (1). Using the law of large numbers, we can approximate the integral by the expected value over a large number of samples. So the plan will be to take a very large sample $\left\{z_1, ..., z_n\right\}$ from $P(z)$, then compute $P(X) \approx \frac{1}{n}\sum_i P\left(X\vert z_i;\theta\right)$. Unfortunately the plan is infeasible because in high dimensional spaces, $n$ needs to be very large in order to have a good enough approximation of $P(X)$ (imagine how much samples you would need for $200 \times 200 \times 3$ images, which is in 120K dimensional space?)
Now the key to realize is that we don’t need to sample $z$ from all over $P(z)$. In fact, we only need to sample $z$ such that $f(z;\theta)$ is more likely to be similar to samples in the training set. Moreover, it is likely that for most of $z$, $P(X\vert z)$ is nearly zero, and therefore contribute very little into the estimation of $P(X)$. So the question is: is there any way to sample $z$ such that it is likely to generate $X$, and only estimate $P(X)$ from those?
It is the key idea behind VAEs.

That’s quite enough for an overview. Next time we will do some maths and see how we go about maximizing (1). Hopefully I can then convince you that VAEs, GANs and GSNs are really not far away from each other, at least in their core ideas.